Split View: CUDA 프로그래밍 완전 정복: GPU 병렬 컴퓨팅 Zero to Hero

CUDA 프로그래밍 완전 정복: GPU 병렬 컴퓨팅 Zero to Hero

개요

CUDA(Compute Unified Device Architecture)는 NVIDIA가 개발한 병렬 컴퓨팅 플랫폼 및 프로그래밍 모델입니다. GPU의 수천 개 코어를 활용하여 CPU 단독으로는 불가능한 수준의 연산 처리량을 달성할 수 있습니다. 딥러닝, 과학 시뮬레이션, 이미지 처리 등 다양한 분야에서 CUDA는 핵심 기술로 자리잡고 있습니다.

이 가이드는 CUDA를 처음 접하는 개발자부터 고급 최적화를 원하는 엔지니어까지 모두를 대상으로, GPU 아키텍처 이해부터 실전 딥러닝 커스텀 연산 작성까지 단계적으로 다룹니다.

1. GPU 아키텍처 이해

1.1 CPU vs GPU 설계 철학

CPU는 복잡한 제어 흐름과 낮은 레이턴시를 위해 설계된 범용 프로세서입니다. 최신 CPU는 16~64개 코어를 가지며, 각 코어는 고성능 단일 스레드 실행에 최적화되어 있습니다. 반면 GPU는 수천 개의 단순한 코어를 가지며, 동일한 연산을 대량의 데이터에 동시에 적용하는 SIMT(Single Instruction, Multiple Threads) 모델로 동작합니다.

NVIDIA A100 GPU를 기준으로:

6912개의 CUDA Core (FP32 연산)
432개의 Tensor Core (행렬 연산 가속)
80GB HBM2e 메모리 (2TB/s 대역폭)
312 TFLOPS (FP16 Tensor Core 기준)

1.2 NVIDIA GPU 구조

Streaming Multiprocessor (SM)

SM은 GPU의 기본 실행 단위입니다. A100 GPU는 108개의 SM을 가지며, 각 SM은 다음을 포함합니다:

64개의 CUDA Core (FP32)
4개의 Tensor Core (3세대)
256KB L1 Cache / Shared Memory (설정 가능)
32 Load/Store Unit
4개의 Special Function Unit (SFU)
최대 2048개 동시 스레드

CUDA Core vs Tensor Core

CUDA Core는 단일 FP32 또는 INT32 연산을 한 사이클에 처리합니다. Tensor Core는 행렬 곱셈-누산(MMA) 연산을 하드웨어 수준에서 가속합니다. A100의 3세대 Tensor Core는 한 사이클에 4x4 FP16 행렬 연산(16개 FMA)을 처리하여 CUDA Core 대비 8배 이상의 처리량을 제공합니다.

1.3 Ampere 아키텍처 (A100)

A100 GPU 구조
┌─────────────────────────────────────────┐
│  GPC (Graphics Processing Cluster) x 8  │
│  ┌─────────────────────────────────┐    │
│  │  TPC (Texture Processing Cluster)│    │
│  │  ┌──────────┐  ┌──────────┐    │    │
│  │  │    SM    │  │    SM    │    │    │
│  │  │ 64 CUDA  │  │ 64 CUDA  │    │    │
│  │  │  4 TC    │  │  4 TC    │    │    │
│  │  └──────────┘  └──────────┘    │    │
│  └─────────────────────────────────┘    │
│                                         │
│  L2 Cache: 40MB                         │
│  HBM2e: 80GB @ 2TB/s                    │
└─────────────────────────────────────────┘

A100의 주요 특징:

MIG (Multi-Instance GPU): 하나의 GPU를 최대 7개의 독립 인스턴스로 분할
NVLink 3.0: GPU 간 600GB/s 양방향 대역폭
TF32: FP32 범위 + FP16 정밀도의 새로운 수치 형식
Structural Sparsity: 2:4 희소성 패턴으로 Tensor Core 처리량 2배 향상

1.4 Hopper 아키텍처 (H100)

H100은 2022년 발표된 NVIDIA의 최신 데이터센터 GPU입니다:

132개 SM, 각 SM에 128개 CUDA Core
4세대 Tensor Core: FP8 지원으로 A100 대비 최대 6배 성능
Transformer Engine: 트랜스포머 모델 레이어별 자동 FP8/FP16 전환
NVLink 4.0: 900GB/s 양방향 대역폭
HBM3: 80GB @ 3.35TB/s

1.5 Thread 계층 구조

CUDA는 3단계 스레드 계층을 사용합니다:

Grid (전체 커널 실행)
  └── Block (SM에서 실행, 스레드 간 shared memory 공유)
        └── Thread (개별 실행 단위)

각 수준은 3차원으로 구성됩니다:

// 2D Grid of 2D Blocks 예시
dim3 gridDim(32, 32, 1);    // 32x32 = 1024 blocks
dim3 blockDim(16, 16, 1);   // 16x16 = 256 threads per block
// 총 스레드: 1024 * 256 = 262,144

Warp

Warp는 32개의 스레드로 구성된 SM의 실행 단위입니다. 하나의 Warp 내 모든 스레드는 동일한 명령어를 동시에 실행합니다(SIMT). Warp는 CUDA 스레드 모델의 핵심이며, 최적화의 많은 부분이 Warp 단위의 동작을 이해하는 데서 출발합니다.

1.6 메모리 계층 구조

레이턴시 (낮음 → 높음)
Register File   : ~1 cycle,    크기: 64KB/SM, 스레드별 최대 255 레지스터
Shared Memory   : ~5 cycles,   크기: 최대 164KB/SM (Ampere)
L1 Cache        : ~30 cycles,  크기: shared memory와 통합
L2 Cache        : ~200 cycles, 크기: 40MB (A100)
Global Memory   : ~600 cycles, 크기: 80GB (HBM2e)

Register: 각 스레드가 독점적으로 사용. 가장 빠름. 레지스터 수가 초과되면 Local Memory(실제로는 Global Memory)로 spilling 발생.

Shared Memory: 같은 Block 내 스레드가 공유. 명시적으로 관리해야 하며 L1 Cache와 물리적으로 같은 공간을 나눠 씁니다.

Global Memory: 모든 스레드에서 접근 가능한 주 메모리. 레이턴시가 높으므로 접근 패턴 최적화(Coalescing)가 필수입니다.

2. CUDA 개발 환경 설정

2.1 CUDA Toolkit 설치

NVIDIA 공식 사이트(https://developer.nvidia.com/cuda-toolkit)에서 운영체제에 맞는 패키지를 다운로드합니다.

Ubuntu 22.04 기준:

# CUDA Keyring 설치
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb

# 패키지 목록 업데이트 및 CUDA 설치
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

# 환경 변수 설정 (~/.bashrc 또는 ~/.zshrc에 추가)
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

설치 확인:

nvcc --version
# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2023 NVIDIA Corporation
# Cuda compilation tools, release 12.3, V12.3.107

nvidia-smi
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 545.23.08    Driver Version: 545.23.08    CUDA Version: 12.3     |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

2.2 nvcc 컴파일러

nvcc는 CUDA C/C++ 코드를 컴파일하는 NVIDIA의 컴파일러입니다. 내부적으로 호스트 코드는 g++/clang++로, 디바이스 코드는 PTX로 변환 후 SASS(실제 GPU 명령어)로 컴파일합니다.

# 기본 컴파일
nvcc -o hello_cuda hello_cuda.cu

# 최적화 플래그와 함께
nvcc -O3 -arch=sm_80 -o matmul matmul.cu

# 디버그 빌드
nvcc -G -g -o debug_kernel debug_kernel.cu

# 여러 GPU 아키텍처 지원
nvcc -gencode arch=compute_75,code=sm_75 \
     -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_86,code=sm_86 \
     -o multi_arch multi_arch.cu

# 주요 플래그
# -arch=sm_XX : 타겟 아키텍처 (sm_80 = Ampere A100)
# -O3         : 최적화 레벨
# -G          : 디바이스 코드 디버그 정보
# -lineinfo   : 프로파일링을 위한 라인 정보
# --ptxas-options=-v : 레지스터/메모리 사용량 출력

2.3 nvidia-smi 활용

# GPU 상태 실시간 모니터링
nvidia-smi dmon -s u  # GPU 사용률
nvidia-smi dmon -s m  # 메모리 사용률

# 특정 GPU 상세 정보
nvidia-smi -q -d MEMORY,UTILIZATION

# GPU 시계 고정 (벤치마킹용)
sudo nvidia-smi -pm 1  # Persistence mode 활성화
sudo nvidia-smi --lock-gpu-clocks=1410  # 클럭 고정

3. CUDA 기초 프로그래밍

3.1 Hello CUDA World

#include <cuda_runtime.h>
#include <stdio.h>

// __global__ : GPU에서 실행, CPU에서 호출
__global__ void helloKernel() {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    printf("Hello from Thread %d (Block %d, Thread %d)\n",
           tid, blockIdx.x, threadIdx.x);
}

int main() {
    // 2 blocks, 4 threads per block = 8 total threads
    helloKernel<<<2, 4>>>();

    // GPU 실행 완료 대기
    cudaDeviceSynchronize();

    return 0;
}

CUDA 함수 한정자:

__global__: GPU에서 실행, CPU에서 호출 (커널)
__device__: GPU에서 실행, GPU에서만 호출
__host__: CPU에서 실행, CPU에서 호출 (기본값)
__host__ __device__: CPU와 GPU 모두에서 실행 가능

3.2 스레드 인덱싱

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void indexingDemo() {
    // 1D 인덱싱
    int global_tid_1d = threadIdx.x + blockIdx.x * blockDim.x;

    // 2D 인덱싱
    int row = threadIdx.y + blockIdx.y * blockDim.y;
    int col = threadIdx.x + blockIdx.x * blockDim.x;
    int global_tid_2d = row * (gridDim.x * blockDim.x) + col;

    // 3D 인덱싱
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    int z = threadIdx.z + blockIdx.z * blockDim.z;
    int width  = gridDim.x * blockDim.x;
    int height = gridDim.y * blockDim.y;
    int global_tid_3d = z * (width * height) + y * width + x;

    if (global_tid_1d == 0) {
        printf("Block dims: (%d, %d, %d)\n",
               blockDim.x, blockDim.y, blockDim.z);
        printf("Grid dims:  (%d, %d, %d)\n",
               gridDim.x, gridDim.y, gridDim.z);
    }
}

int main() {
    dim3 block(4, 4, 1);
    dim3 grid(2, 2, 1);
    indexingDemo<<<grid, block>>>();
    cudaDeviceSynchronize();
    return 0;
}

3.3 벡터 덧셈

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define CUDA_CHECK(err) \
    do { \
        cudaError_t e = (err); \
        if (e != cudaSuccess) { \
            fprintf(stderr, "CUDA error %s:%d: %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(e)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// 벡터 덧셈 커널
__global__ void vectorAdd(const float* A, const float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // 경계 조건 확인 (N이 blockDim의 배수가 아닐 경우)
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    const int N = 1 << 20;  // 1M 요소
    const size_t bytes = N * sizeof(float);

    // 호스트 메모리 할당
    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    // 초기화
    for (int i = 0; i < N; i++) {
        h_A[i] = (float)rand() / RAND_MAX;
        h_B[i] = (float)rand() / RAND_MAX;
    }

    // 디바이스 메모리 할당
    float *d_A, *d_B, *d_C;
    CUDA_CHECK(cudaMalloc(&d_A, bytes));
    CUDA_CHECK(cudaMalloc(&d_B, bytes));
    CUDA_CHECK(cudaMalloc(&d_C, bytes));

    // 호스트 -> 디바이스 복사
    CUDA_CHECK(cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice));

    // 커널 실행
    int blockSize = 256;
    int gridSize = (N + blockSize - 1) / blockSize;
    vectorAdd<<<gridSize, blockSize>>>(d_A, d_B, d_C, N);

    // 에러 확인
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    // 디바이스 -> 호스트 복사
    CUDA_CHECK(cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost));

    // 결과 검증
    float maxError = 0.0f;
    for (int i = 0; i < N; i++) {
        maxError = fmaxf(maxError, fabsf(h_C[i] - (h_A[i] + h_B[i])));
    }
    printf("Max error: %e\n", maxError);

    // 메모리 해제
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

3.4 행렬 곱셈 (Naive 버전)

#include <cuda_runtime.h>
#include <stdio.h>

#define N 1024  // 행렬 크기 N x N

// Naive 행렬 곱셈: C = A * B
__global__ void matMulNaive(const float* A, const float* B, float* C, int n) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < n && col < n) {
        float sum = 0.0f;
        for (int k = 0; k < n; k++) {
            sum += A[row * n + k] * B[k * n + col];
        }
        C[row * n + col] = sum;
    }
}

int main() {
    int n = N;
    size_t bytes = n * n * sizeof(float);

    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    // 행렬 초기화
    for (int i = 0; i < n * n; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 1.0f;
    }

    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytes);
    cudaMalloc(&d_B, bytes);
    cudaMalloc(&d_C, bytes);

    cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);

    // 16x16 스레드 블록 사용
    dim3 blockDim(16, 16);
    dim3 gridDim((n + blockDim.x - 1) / blockDim.x,
                 (n + blockDim.y - 1) / blockDim.y);

    // 타이밍 측정
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    matMulNaive<<<gridDim, blockDim>>>(d_A, d_B, d_C, n);
    cudaEventRecord(stop);

    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    printf("Matrix multiplication (%dx%d): %.3f ms\n", n, n, milliseconds);

    // 이론적 FLOPS: 2 * N^3
    double flops = 2.0 * n * n * n;
    double gflops = flops / (milliseconds * 1e6);
    printf("Performance: %.2f GFLOPS\n", gflops);

    cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost);

    // 결과 검증 (C[0][0] == N)
    printf("C[0][0] = %.1f (expected %.1f)\n", h_C[0], (float)n);

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}

4. CUDA 메모리 관리

4.1 cudaMalloc / cudaFree / cudaMemcpy

// 기본 메모리 연산
float *d_ptr;
cudaMalloc(&d_ptr, size);           // GPU 메모리 할당
cudaMemset(d_ptr, 0, size);         // GPU 메모리 초기화
cudaMemcpy(d_ptr, h_ptr, size, cudaMemcpyHostToDevice);    // H->D
cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);    // D->H
cudaMemcpy(d_dst, d_src, size, cudaMemcpyDeviceToDevice);  // D->D
cudaFree(d_ptr);                    // GPU 메모리 해제

// 2D 메모리 (행렬에 유용)
float *d_matrix;
size_t pitch;
// pitch: 각 행의 실제 바이트 수 (정렬 패딩 포함)
cudaMallocPitch(&d_matrix, &pitch, width * sizeof(float), height);
cudaMemcpy2D(d_matrix, pitch,
             h_matrix, width * sizeof(float),
             width * sizeof(float), height,
             cudaMemcpyHostToDevice);

4.2 Unified Memory

Unified Memory는 CPU와 GPU가 동일한 가상 주소 공간을 공유하는 메모리 모델입니다.

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void incrementKernel(int* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] += 1;
    }
}

int main() {
    const int N = 1 << 20;
    int *data;

    // Unified Memory 할당 - CPU와 GPU 모두 접근 가능
    cudaMallocManaged(&data, N * sizeof(int));

    // CPU에서 초기화 (cudaMemcpy 불필요)
    for (int i = 0; i < N; i++) {
        data[i] = i;
    }

    // GPU에서 처리
    incrementKernel<<<(N + 255) / 256, 256>>>(data, N);
    cudaDeviceSynchronize();

    // CPU에서 바로 결과 확인 (cudaMemcpy 불필요)
    printf("data[0] = %d (expected 1)\n", data[0]);
    printf("data[N-1] = %d (expected %d)\n", data[N-1], N);

    cudaFree(data);
    return 0;
}

Unified Memory의 장점은 코드 간소화이지만, Pascal 이전 GPU에서는 페이지 폴트로 인한 성능 저하가 있습니다. Ampere 이상에서는 prefetching 힌트를 사용하여 성능을 최적화할 수 있습니다:

// 데이터를 GPU로 미리 이동
cudaMemPrefetchAsync(data, N * sizeof(int), deviceId);
// 데이터를 CPU로 미리 이동
cudaMemPrefetchAsync(data, N * sizeof(int), cudaCpuDeviceId);
// 메모리 접근 힌트
cudaMemAdvise(data, N * sizeof(int),
              cudaMemAdviseSetPreferredLocation, deviceId);

4.3 Shared Memory를 활용한 타일 행렬 곱셈

Shared Memory를 사용하면 Global Memory 접근 횟수를 크게 줄일 수 있습니다.

#include <cuda_runtime.h>
#include <stdio.h>

#define TILE_SIZE 16

__global__ void matMulTiled(const float* A, const float* B, float* C, int n) {
    // Shared Memory 선언
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    // 타일 단위로 순회
    for (int t = 0; t < (n + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // 타일 로드 (각 스레드가 하나의 요소 담당)
        int aCol = t * TILE_SIZE + threadIdx.x;
        int bRow = t * TILE_SIZE + threadIdx.y;

        tileA[threadIdx.y][threadIdx.x] = (row < n && aCol < n)
            ? A[row * n + aCol] : 0.0f;
        tileB[threadIdx.y][threadIdx.x] = (bRow < n && col < n)
            ? B[bRow * n + col] : 0.0f;

        // 모든 스레드가 로드 완료할 때까지 대기
        __syncthreads();

        // 타일 내 연산
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        // 다음 타일 로드 전 동기화
        __syncthreads();
    }

    if (row < n && col < n) {
        C[row * n + col] = sum;
    }
}

타일 행렬 곱셈의 성능 향상 이유:

Naive 버전: 각 스레드가 2N번 Global Memory에 접근
Tiled 버전: N/TILE_SIZE 번 Shared Memory 타일 로드 후 TILE_SIZE 재사용
메모리 접근 감소: TILE_SIZE배 (16배) 감소

4.4 Pinned Memory (Page-Locked Memory)

#include <cuda_runtime.h>

// 일반 malloc vs pinned memory 전송 속도 비교
void compareBandwidth(int N) {
    size_t bytes = N * sizeof(float);
    float *h_pageable, *h_pinned;
    float *d_data;

    // 일반 메모리
    h_pageable = (float*)malloc(bytes);
    // Pinned Memory (페이지 고정 메모리)
    cudaMallocHost(&h_pinned, bytes);
    cudaMalloc(&d_data, bytes);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Pageable 전송 시간 측정
    cudaEventRecord(start);
    for (int i = 0; i < 10; i++) {
        cudaMemcpy(d_data, h_pageable, bytes, cudaMemcpyHostToDevice);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float pageable_ms;
    cudaEventElapsedTime(&pageable_ms, start, stop);

    // Pinned 전송 시간 측정
    cudaEventRecord(start);
    for (int i = 0; i < 10; i++) {
        cudaMemcpy(d_data, h_pinned, bytes, cudaMemcpyHostToDevice);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float pinned_ms;
    cudaEventElapsedTime(&pinned_ms, start, stop);

    printf("Pageable: %.2f GB/s\n",
           (bytes * 10.0) / (pageable_ms * 1e6));
    printf("Pinned:   %.2f GB/s\n",
           (bytes * 10.0) / (pinned_ms * 1e6));

    free(h_pageable);
    cudaFreeHost(h_pinned);
    cudaFree(d_data);
}

Pinned Memory는 물리 메모리에 고정되어 OS가 페이지 아웃할 수 없으므로, DMA 전송이 CPU 개입 없이 직접 이뤄져 전송 속도가 2-3배 빠릅니다. 단, 과도한 사용은 시스템 전체 메모리 성능에 영향을 줄 수 있습니다.

5. 고급 CUDA 최적화

5.1 Memory Coalescing

Global Memory 접근에서 Coalescing은 가장 중요한 최적화 중 하나입니다. 하나의 Warp(32 스레드)가 연속적인 메모리를 접근하면 단일 트랜잭션으로 처리됩니다.

// 나쁜 예: Stride 접근 (non-coalesced)
__global__ void badAccess(float* data, int N, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N / stride) {
        data[idx * stride] *= 2.0f;  // stride 간격으로 접근
    }
}

// 좋은 예: Sequential 접근 (coalesced)
__global__ void goodAccess(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] *= 2.0f;  // 연속 접근
    }
}

// 행렬 전치의 올바른 구현
// 입력: 행 우선 순서 (row-major)
// 출력: 전치 행렬
#define TILE 32
#define PADDING 1  // Bank conflict 방지

__global__ void transposeCoalesced(const float* in, float* out,
                                    int rows, int cols) {
    __shared__ float tile[TILE][TILE + PADDING];

    int x = blockIdx.x * TILE + threadIdx.x;
    int y = blockIdx.y * TILE + threadIdx.y;

    // 입력: coalesced read
    if (x < cols && y < rows) {
        tile[threadIdx.y][threadIdx.x] = in[y * cols + x];
    }
    __syncthreads();

    // 전치된 위치로 쓰기
    x = blockIdx.y * TILE + threadIdx.x;
    y = blockIdx.x * TILE + threadIdx.y;

    // 출력: coalesced write
    if (x < rows && y < cols) {
        out[y * rows + x] = tile[threadIdx.x][threadIdx.y];
    }
}

5.2 Shared Memory Bank Conflicts

Shared Memory는 32개의 뱅크(Bank)로 나뉘어 있으며, 각 뱅크는 4바이트 단위로 인터리브됩니다. 같은 Warp 내 스레드들이 같은 뱅크에 접근하면 직렬화됩니다.

// Bank conflict 발생: stride=1이면 conflict 없음, stride=16이면 2-way conflict
__global__ void bankConflictDemo() {
    __shared__ float shared[32];

    int tid = threadIdx.x;

    // No conflict: 스레드 i가 shared[i] 접근 (각기 다른 뱅크)
    float val1 = shared[tid];

    // 2-way bank conflict: 스레드 0,16 모두 뱅크 0 접근
    float val2 = shared[tid * 2 % 32];

    // 32-way bank conflict: 모든 스레드가 뱅크 0 접근
    float val3 = shared[0];  // broadcast이므로 실제로는 conflict 없음

    (void)val1; (void)val2; (void)val3;
}

// 패딩으로 Bank conflict 해결
__global__ void noBankConflict(float* data) {
    // 각 행에 padding 추가하여 뱅크 충돌 방지
    __shared__ float tile[32][33];  // 33 = 32 + 1 padding

    int tid = threadIdx.x;
    int bid = threadIdx.y;

    tile[bid][tid] = data[bid * 32 + tid];
    __syncthreads();

    // 열 방향 접근도 bank conflict 없음
    data[tid * 32 + bid] = tile[bid][tid];
}

5.3 Warp Divergence 최소화

// 나쁜 예: Warp 내 분기
__global__ void warpDivergenceBad(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        if (idx % 2 == 0) {         // Warp 내 절반의 스레드만 실행
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;  // 나머지 절반
        }
        // 실제로는 두 경로가 순차 실행 → 2배 느림
    }
}

// 좋은 예: 분기 없이 처리
__global__ void noWarpDivergence(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float mask = (float)(idx % 2);  // 0.0 또는 1.0
        // 모든 스레드가 동일한 연산 수행
        data[idx] = data[idx] * (2.0f - mask) + mask;
    }
}

// 또는 데이터를 재배열하여 분기 패턴을 Warp 경계에 맞춤
__global__ void warpAlignedBranch(float* even_data, float* odd_data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warp_id = idx / 32;

    if (warp_id % 2 == 0) {
        // 이 Warp의 모든 스레드가 같은 분기 실행 → No divergence
        if (idx < N) even_data[idx] *= 2.0f;
    } else {
        if (idx < N) odd_data[idx] += 1.0f;
    }
}

5.4 Occupancy 최적화

Occupancy는 SM에서 동시에 활성화된 Warp 수를 최대 Warp 수로 나눈 비율입니다.

#include <cuda_runtime.h>
#include <cuda_occupancy.h>

// Occupancy 계산 예제
void checkOccupancy() {
    // 커널 함수 포인터
    void* kernelFunc = (void*)matMulTiled;

    int blockSize;
    int minGridSize;

    // 최적 block size 자동 계산
    cudaOccupancyMaxPotentialBlockSize(
        &minGridSize,
        &blockSize,
        kernelFunc,
        0,    // dynamic shared memory
        0     // block size limit (0 = no limit)
    );

    printf("Optimal block size: %d\n", blockSize);
    printf("Minimum grid size: %d\n", minGridSize);

    // 실제 Occupancy 계산
    int maxActiveBlocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &maxActiveBlocks,
        kernelFunc,
        blockSize,
        0
    );

    int device;
    cudaGetDevice(&device);
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, device);

    float occupancy = (maxActiveBlocks * blockSize / prop.warpSize)
                    / (float)(prop.maxThreadsPerMultiProcessor / prop.warpSize);

    printf("Theoretical occupancy: %.1f%%\n", occupancy * 100);
}

레지스터 수와 Shared Memory 크기가 Occupancy에 직접 영향을 미칩니다:

// 레지스터 수 제한으로 Occupancy 향상 시도
__global__ __launch_bounds__(256, 4)  // maxThreadsPerBlock=256, minBlocksPerSM=4
void optimizedKernel(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] *= 2.0f;
    }
}

5.5 CUDA Streams와 비동기 실행

#include <cuda_runtime.h>

#define NUM_STREAMS 4
#define CHUNK_SIZE (1 << 20)

void asyncProcessing(float* h_in, float* h_out, int N) {
    cudaStream_t streams[NUM_STREAMS];
    float *d_in[NUM_STREAMS], *d_out[NUM_STREAMS];

    // 스트림 및 메모리 생성
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamCreate(&streams[i]);
        cudaMalloc(&d_in[i],  CHUNK_SIZE * sizeof(float));
        cudaMalloc(&d_out[i], CHUNK_SIZE * sizeof(float));
    }

    // 청크별 파이프라인 처리
    int numChunks = (N + CHUNK_SIZE - 1) / CHUNK_SIZE;

    for (int chunk = 0; chunk < numChunks; chunk++) {
        int streamId = chunk % NUM_STREAMS;
        int offset = chunk * CHUNK_SIZE;
        int size = min(CHUNK_SIZE, N - offset);

        // 비동기 전송 및 커널 실행 (서로 다른 스트림에서 오버랩)
        cudaMemcpyAsync(d_in[streamId], h_in + offset,
                        size * sizeof(float),
                        cudaMemcpyHostToDevice, streams[streamId]);

        // 커널 실행 (스트림에서 비동기 실행)
        int blockSize = 256;
        int gridSize = (size + blockSize - 1) / blockSize;
        // 실제 커널은 여기서 호출

        cudaMemcpyAsync(h_out + offset, d_out[streamId],
                        size * sizeof(float),
                        cudaMemcpyDeviceToHost, streams[streamId]);
    }

    // 모든 스트림 완료 대기
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamSynchronize(streams[i]);
        cudaStreamDestroy(streams[i]);
        cudaFree(d_in[i]);
        cudaFree(d_out[i]);
    }
}

스트림 간 의존성 관리:

// cudaEvent를 이용한 스트림 간 동기화
cudaEvent_t event;
cudaEventCreate(&event);

// Stream 0에서 이벤트 기록
cudaEventRecord(event, stream0);

// Stream 1이 이벤트 완료를 기다림
cudaStreamWaitEvent(stream1, event, 0);

// Stream 1은 Stream 0의 특정 작업 완료 후 실행됨

5.6 cudaEvent로 성능 측정

#include <cuda_runtime.h>

float measureKernelTime(void (*kernel)(float*, int),
                         float* d_data, int N, int runs) {
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Warm-up run
    kernel(d_data, N);
    cudaDeviceSynchronize();

    // 측정
    cudaEventRecord(start);
    for (int i = 0; i < runs; i++) {
        kernel(d_data, N);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float totalMs = 0;
    cudaEventElapsedTime(&totalMs, start, stop);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return totalMs / runs;  // 평균 실행 시간 (ms)
}

6. cuBLAS - 선형대수 라이브러리

cuBLAS는 NVIDIA의 고성능 BLAS(Basic Linear Algebra Subroutines) 구현체입니다. Tensor Core를 자동으로 활용하여 수동 구현 대비 훨씬 높은 성능을 제공합니다.

참고: https://docs.nvidia.com/cuda/cublas/

6.1 기본 설정

#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <stdio.h>

#define CUBLAS_CHECK(err) \
    do { \
        cublasStatus_t e = (err); \
        if (e != CUBLAS_STATUS_SUCCESS) { \
            fprintf(stderr, "cuBLAS error: %d at %s:%d\n", \
                    e, __FILE__, __LINE__); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

int main() {
    cublasHandle_t handle;
    CUBLAS_CHECK(cublasCreate(&handle));

    // cuBLAS는 Column-Major 순서 사용 (Fortran 스타일)
    // C++ Row-Major 행렬을 사용할 때 주의 필요

    // ... 연산 수행 ...

    CUBLAS_CHECK(cublasDestroy(handle));
    return 0;
}

6.2 SGEMM - 단정밀도 행렬 곱셈

cuBLAS는 Column-Major를 사용하므로, Row-Major 행렬 C = A _ B는 Column-Major로 Ct = Bt _ At로 계산합니다.

#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <stdio.h>

// C = alpha * A * B + beta * C
// A: m x k, B: k x n, C: m x n
void cublasMatMul(cublasHandle_t handle,
                   int m, int n, int k,
                   const float* d_A, const float* d_B, float* d_C) {
    const float alpha = 1.0f;
    const float beta  = 0.0f;

    // cuBLAS는 Column-Major: C = A*B를 Ct = Bt*At로 계산
    // 실제 호출: C(col) = B(col) * A(col) = (A*B)(row) 전치
    cublasSgemm(handle,
                CUBLAS_OP_N,  // B 전치 안 함
                CUBLAS_OP_N,  // A 전치 안 함
                n, m, k,       // 출력 크기: n x m (column-major에서 n행 m열)
                &alpha,
                d_B, n,        // B: n x k (column-major)
                d_A, k,        // A: k x m (column-major)
                &beta,
                d_C, n);       // C: n x m (column-major)
}

int main() {
    int m = 1024, n = 1024, k = 1024;
    size_t bytesA = m * k * sizeof(float);
    size_t bytesB = k * n * sizeof(float);
    size_t bytesC = m * n * sizeof(float);

    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytesA);
    cudaMalloc(&d_B, bytesB);
    cudaMalloc(&d_C, bytesC);

    // 초기화 (실제 코드에서는 cudaMemcpy 사용)
    cudaMemset(d_A, 0, bytesA);
    cudaMemset(d_B, 0, bytesB);

    cublasHandle_t handle;
    cublasCreate(&handle);

    // Tensor Core 사용 허용 (자동)
    cublasSetMathMode(handle, CUBLAS_DEFAULT_MATH);

    // FP16 Tensor Core 강제 사용
    // cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    cublasMatMul(handle, m, n, k, d_A, d_B, d_C);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms;
    cudaEventElapsedTime(&ms, start, stop);
    double tflops = 2.0 * m * n * k / (ms * 1e9);
    printf("cuBLAS SGEMM: %.2f ms, %.2f TFLOPS\n", ms, tflops);

    cublasDestroy(handle);
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    return 0;
}

6.3 Batched GEMM

여러 개의 작은 행렬 곱셈을 한 번에 처리합니다:

// Batched SGEMM: 배치 내 각 행렬 곱셈 수행
void batchedMatMul(cublasHandle_t handle,
                    int m, int n, int k, int batchCount,
                    float** d_Aarray, float** d_Barray, float** d_Carray) {
    const float alpha = 1.0f;
    const float beta  = 0.0f;

    cublasSgemmBatched(handle,
                       CUBLAS_OP_N, CUBLAS_OP_N,
                       n, m, k,
                       &alpha,
                       (const float**)d_Barray, n,
                       (const float**)d_Aarray, k,
                       &beta,
                       d_Carray, n,
                       batchCount);
}

// Strided Batched GEMM (메모리가 연속적인 경우)
void stridedBatchedMatMul(cublasHandle_t handle,
                           int m, int n, int k, int batchCount,
                           float* d_A, float* d_B, float* d_C) {
    const float alpha = 1.0f;
    const float beta  = 0.0f;

    long long strideA = (long long)m * k;
    long long strideB = (long long)k * n;
    long long strideC = (long long)m * n;

    cublasSgemmStridedBatched(handle,
                               CUBLAS_OP_N, CUBLAS_OP_N,
                               n, m, k,
                               &alpha,
                               d_B, n, strideB,
                               d_A, k, strideA,
                               &beta,
                               d_C, n, strideC,
                               batchCount);
}

7. cuDNN - 딥러닝 프리미티브

cuDNN은 딥러닝 연산(Convolution, Pooling, Normalization 등)의 고성능 구현체입니다. PyTorch, TensorFlow 등 대부분의 딥러닝 프레임워크가 cuDNN을 백엔드로 사용합니다.

참고: https://docs.nvidia.com/deeplearning/cudnn/

7.1 cuDNN Convolution

#include <cudnn.h>
#include <cuda_runtime.h>
#include <stdio.h>

#define CUDNN_CHECK(err) \
    do { \
        cudnnStatus_t e = (err); \
        if (e != CUDNN_STATUS_SUCCESS) { \
            fprintf(stderr, "cuDNN error: %s at %s:%d\n", \
                    cudnnGetErrorString(e), __FILE__, __LINE__); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

int main() {
    cudnnHandle_t handle;
    CUDNN_CHECK(cudnnCreate(&handle));

    // 입력 텐서: N=1, C=3, H=224, W=224 (NCHW 형식)
    int N = 1, C = 3, H = 224, W = 224;
    // 필터: K=64, C=3, R=3, S=3
    int K = 64, R = 3, S = 3;
    int pad_h = 1, pad_w = 1;
    int stride_h = 1, stride_w = 1;
    int dilation_h = 1, dilation_w = 1;

    // 텐서 디스크립터
    cudnnTensorDescriptor_t inputDesc, outputDesc;
    cudnnFilterDescriptor_t filterDesc;
    cudnnConvolutionDescriptor_t convDesc;

    CUDNN_CHECK(cudnnCreateTensorDescriptor(&inputDesc));
    CUDNN_CHECK(cudnnCreateTensorDescriptor(&outputDesc));
    CUDNN_CHECK(cudnnCreateFilterDescriptor(&filterDesc));
    CUDNN_CHECK(cudnnCreateConvolutionDescriptor(&convDesc));

    // 디스크립터 설정
    CUDNN_CHECK(cudnnSetTensor4dDescriptor(
        inputDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, N, C, H, W));

    CUDNN_CHECK(cudnnSetFilter4dDescriptor(
        filterDesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, K, C, R, S));

    CUDNN_CHECK(cudnnSetConvolution2dDescriptor(
        convDesc, pad_h, pad_w, stride_h, stride_w,
        dilation_h, dilation_w, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT));

    // Tensor Core 사용을 위한 수학 타입 설정
    CUDNN_CHECK(cudnnSetConvolutionMathType(convDesc, CUDNN_TENSOR_OP_MATH));

    // 출력 크기 계산
    int out_N, out_C, out_H, out_W;
    CUDNN_CHECK(cudnnGetConvolution2dForwardOutputDim(
        convDesc, inputDesc, filterDesc,
        &out_N, &out_C, &out_H, &out_W));
    printf("Output: %d x %d x %d x %d\n", out_N, out_C, out_H, out_W);

    CUDNN_CHECK(cudnnSetTensor4dDescriptor(
        outputDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
        out_N, out_C, out_H, out_W));

    // 최적 알고리즘 선택
    cudnnConvolutionFwdAlgoPerf_t perfResults[10];
    int returnedCount;
    CUDNN_CHECK(cudnnFindConvolutionForwardAlgorithm(
        handle, inputDesc, filterDesc, convDesc, outputDesc,
        10, &returnedCount, perfResults));

    cudnnConvolutionFwdAlgo_t algo = perfResults[0].algo;
    printf("Best algorithm: %d (%.3f ms)\n",
           algo, perfResults[0].time);

    // Workspace 크기 계산 및 할당
    size_t workspaceSize;
    CUDNN_CHECK(cudnnGetConvolutionForwardWorkspaceSize(
        handle, inputDesc, filterDesc, convDesc, outputDesc,
        algo, &workspaceSize));

    void* workspace;
    cudaMalloc(&workspace, workspaceSize);

    // 메모리 할당
    float *d_input, *d_filter, *d_output;
    cudaMalloc(&d_input,  N * C * H * W * sizeof(float));
    cudaMalloc(&d_filter, K * C * R * S * sizeof(float));
    cudaMalloc(&d_output, out_N * out_C * out_H * out_W * sizeof(float));

    // Convolution Forward Pass
    float alpha = 1.0f, beta = 0.0f;
    CUDNN_CHECK(cudnnConvolutionForward(
        handle, &alpha,
        inputDesc, d_input,
        filterDesc, d_filter,
        convDesc, algo, workspace, workspaceSize,
        &beta,
        outputDesc, d_output));

    printf("Convolution completed successfully!\n");

    // 정리
    cudaFree(d_input); cudaFree(d_filter);
    cudaFree(d_output); cudaFree(workspace);
    cudnnDestroyTensorDescriptor(inputDesc);
    cudnnDestroyTensorDescriptor(outputDesc);
    cudnnDestroyFilterDescriptor(filterDesc);
    cudnnDestroyConvolutionDescriptor(convDesc);
    cudnnDestroy(handle);

    return 0;
}

8. 혼합 정밀도(Mixed Precision) 학습

8.1 수치 형식 비교

형식	비트 수	지수	가수	동적 범위	정밀도
FP32	32	8	23	약 1e38	7자리
FP16	16	5	10	약 65504	3자리
BF16	16	8	7	약 1e38	2자리
TF32	19	8	10	약 1e38	3자리
FP8	8	5/4	2/3	제한적	낮음

FP16은 메모리 사용량을 절반으로 줄이고 Tensor Core를 활용할 수 있지만, 범위가 좁아 Overflow/Underflow 위험이 있습니다. BF16은 FP32와 같은 지수 범위를 가져 더 안정적입니다.

8.2 CUDA에서 FP16 활용

#include <cuda_fp16.h>
#include <cuda_runtime.h>
#include <stdio.h>

// FP16 벡터 덧셈
__global__ void fp16VectorAdd(const __half* A, const __half* B,
                               __half* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = __hadd(A[idx], B[idx]);  // __half 덧셈
    }
}

// FP16 벡터 덧셈 (벡터화 버전 - 2개씩 처리)
__global__ void fp16VectorAddVec2(const __half2* A, const __half2* B,
                                   __half2* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N / 2) {
        C[idx] = __hadd2(A[idx], B[idx]);  // 2개 동시 처리
    }
}

// FP32와 FP16 간 변환
__global__ void convertF32toF16(const float* f32, __half* f16, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        f16[idx] = __float2half(f32[idx]);
    }
}

__global__ void convertF16toF32(const __half* f16, float* f32, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        f32[idx] = __half2float(f16[idx]);
    }
}

8.3 PyTorch AMP (Automatic Mixed Precision)

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

# 모델 정의
model = nn.Sequential(
    nn.Linear(1024, 4096),
    nn.ReLU(),
    nn.Linear(4096, 4096),
    nn.ReLU(),
    nn.Linear(4096, 10)
).cuda()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# GradScaler: Loss Scaling으로 FP16 underflow 방지
scaler = GradScaler()

def train_step(x, y):
    optimizer.zero_grad()

    # autocast: FP16/BF16으로 자동 변환
    with autocast(device_type='cuda', dtype=torch.float16):
        output = model(x)
        loss = criterion(output, y)

    # Scaled backward pass
    scaler.scale(loss).backward()

    # Unscale 후 gradient clipping (선택적)
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # optimizer.step()의 실제 수행 여부 결정
    scaler.step(optimizer)
    scaler.update()  # scale factor 업데이트

    return loss.item()

# BF16 사용 (Ampere 이상)
def train_step_bf16(x, y):
    optimizer.zero_grad()
    with autocast(device_type='cuda', dtype=torch.bfloat16):
        output = model(x)
        loss = criterion(output, y)
    loss.backward()  # BF16은 loss scaling 불필요
    optimizer.step()
    return loss.item()

# 실제 학습 루프
batch_size = 128
for epoch in range(10):
    x = torch.randn(batch_size, 1024).cuda()
    y = torch.randint(0, 10, (batch_size,)).cuda()
    loss = train_step(x, y)
    if epoch % 2 == 0:
        print(f"Epoch {epoch}: loss = {loss:.4f}, "
              f"scale = {scaler.get_scale():.0f}")

8.4 Loss Scaling

FP16의 가수부(mantissa)는 10비트로 제한되어 gradient가 0으로 flush되는 underflow가 발생할 수 있습니다. Loss Scaling은 loss를 큰 값으로 곱해 gradient 크기를 키운 후, optimizer 업데이트 전에 다시 나누는 기법입니다.

# 수동 Loss Scaling (개념 이해용)
scale_factor = 2**15  # 32768

with autocast():
    loss = criterion(output, y)

# Scale된 loss로 backward
(loss * scale_factor).backward()

# Gradient unscale
for param in model.parameters():
    if param.grad is not None:
        param.grad /= scale_factor

# Inf/NaN 체크 후 업데이트
valid = all(
    torch.isfinite(p.grad).all()
    for p in model.parameters()
    if p.grad is not None
)

if valid:
    optimizer.step()
else:
    print("Skipping step: invalid gradients")

GradScaler는 이를 자동화하며, scale factor를 동적으로 조정합니다 (성공하면 2배 증가, 실패하면 절반으로 감소).

9. CUDA Profiling 및 디버깅

9.1 Nsight Systems

Nsight Systems는 전체 시스템 레벨의 프로파일링 도구입니다:

# 명령줄 프로파일링
nsys profile --trace=cuda,nvtx,osrt \
             --output=profile_output \
             ./my_application

# 결과 파일 열기 (GUI)
nsys-ui profile_output.nsys-rep

# 텍스트 리포트 생성
nsys stats profile_output.nsys-rep

9.2 Nsight Compute

커널 수준의 상세 분석:

# 특정 커널 프로파일링
ncu --set full \
    --kernel-name matMulTiled \
    --launch-count 1 \
    --output kernel_profile \
    ./my_application

# 주요 메트릭만 수집
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
l1tex__t_sector_hit_rate.pct,\
l2cache__hit_rate.pct \
    ./my_application

9.3 NVTX 마커로 프로파일링 구간 표시

#include <nvtx3/nvToolsExt.h>

void profiledFunction(float* d_data, int N) {
    // 색상이 있는 범위 마킹
    nvtxRangePushA("Data Transfer H2D");
    cudaMemcpy(d_data, h_data, N * sizeof(float),
               cudaMemcpyHostToDevice);
    nvtxRangePop();

    nvtxRangePushA("Kernel Execution");
    myKernel<<<gridDim, blockDim>>>(d_data, N);
    cudaDeviceSynchronize();
    nvtxRangePop();

    nvtxRangePushA("Data Transfer D2H");
    cudaMemcpy(h_output, d_data, N * sizeof(float),
               cudaMemcpyDeviceToHost);
    nvtxRangePop();
}

9.4 CUDA-MEMCHECK

# 메모리 에러 감지
compute-sanitizer --tool memcheck ./my_application

# Race condition 감지
compute-sanitizer --tool racecheck ./my_application

# 초기화되지 않은 메모리 접근
compute-sanitizer --tool initcheck ./my_application

# 동기화 문제
compute-sanitizer --tool synccheck ./my_application

10. 실전 딥러닝 CUDA 커스텀 연산

10.1 PyTorch Custom CUDA Extension

복잡한 Fused Operation을 직접 구현하여 메모리 접근을 최소화할 수 있습니다. 참고: https://pytorch.org/tutorials/advanced/cpp_extension.html

파일 구조:

custom_op/
  setup.py
  fused_ops.cpp
  fused_ops_kernel.cu

fused_ops_kernel.cu:

#include <cuda_runtime.h>
#include <cuda_fp16.h>

// Fused Bias + ReLU 커널
// 별도의 bias add와 relu 커널 대신 하나로 융합
__global__ void fusedBiasReluKernel(
    float* output,
    const float* input,
    const float* bias,
    int batch_size,
    int features
) {
    int b = blockIdx.x;
    int f = blockIdx.y * blockDim.x + threadIdx.x;

    if (b < batch_size && f < features) {
        float val = input[b * features + f] + bias[f];
        output[b * features + f] = fmaxf(val, 0.0f);  // ReLU
    }
}

// Fused Bias + GELU 커널
// GELU(x) = x * Phi(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
__device__ float gelu_approx(float x) {
    const float sqrt_2_over_pi = 0.7978845608f;
    const float coeff = 0.044715f;
    float x3 = x * x * x;
    float inner = sqrt_2_over_pi * (x + coeff * x3);
    return 0.5f * x * (1.0f + tanhf(inner));
}

__global__ void fusedBiasGeluKernel(
    float* output,
    const float* input,
    const float* bias,
    int batch_size,
    int features
) {
    int b = blockIdx.x;
    int f = blockIdx.y * blockDim.x + threadIdx.x;

    if (b < batch_size && f < features) {
        float val = input[b * features + f] + bias[f];
        output[b * features + f] = gelu_approx(val);
    }
}

// Layer Normalization 커널
// 각 샘플에 대해 평균과 분산을 계산 후 정규화
__global__ void layerNormKernel(
    float* output,
    const float* input,
    const float* gamma,
    const float* beta,
    int batch_size,
    int features,
    float eps
) {
    int b = blockIdx.x;

    if (b >= batch_size) return;

    __shared__ float shared_mean;
    __shared__ float shared_var;

    // Step 1: 평균 계산 (리덕션)
    float sum = 0.0f;
    for (int f = threadIdx.x; f < features; f += blockDim.x) {
        sum += input[b * features + f];
    }

    // Warp 내 리덕션
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        sum += __shfl_down_sync(0xffffffff, sum, offset);
    }

    if (threadIdx.x % warpSize == 0) {
        atomicAdd(&shared_mean, sum);
    }
    __syncthreads();

    float mean = shared_mean / features;

    // Step 2: 분산 계산
    float var_sum = 0.0f;
    for (int f = threadIdx.x; f < features; f += blockDim.x) {
        float diff = input[b * features + f] - mean;
        var_sum += diff * diff;
    }

    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        var_sum += __shfl_down_sync(0xffffffff, var_sum, offset);
    }

    if (threadIdx.x % warpSize == 0) {
        atomicAdd(&shared_var, var_sum);
    }
    __syncthreads();

    float var = shared_var / features;
    float inv_std = rsqrtf(var + eps);

    // Step 3: 정규화 및 스케일/시프트
    for (int f = threadIdx.x; f < features; f += blockDim.x) {
        float normalized = (input[b * features + f] - mean) * inv_std;
        output[b * features + f] = gamma[f] * normalized + beta[f];
    }
}

fused_ops.cpp:

#include <torch/extension.h>
#include <cuda_runtime.h>

// CUDA 커널 선언
void fusedBiasReluKernel(float*, const float*, const float*, int, int);
void fusedBiasGeluKernel(float*, const float*, const float*, int, int);
void layerNormKernel(float*, const float*, const float*,
                      const float*, int, int, float);

// Fused Bias + ReLU PyTorch wrapper
torch::Tensor fused_bias_relu(torch::Tensor input, torch::Tensor bias) {
    TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
    TORCH_CHECK(bias.is_cuda(), "bias must be a CUDA tensor");
    TORCH_CHECK(input.is_contiguous(), "input must be contiguous");

    int batch_size = input.size(0);
    int features = input.size(1);
    auto output = torch::empty_like(input);

    dim3 block(256);
    dim3 grid(batch_size, (features + 255) / 256);

    fusedBiasReluKernel<<<grid, block>>>(
        output.data_ptr<float>(),
        input.data_ptr<float>(),
        bias.data_ptr<float>(),
        batch_size, features
    );

    return output;
}

// Fused Bias + GELU wrapper
torch::Tensor fused_bias_gelu(torch::Tensor input, torch::Tensor bias) {
    TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
    TORCH_CHECK(bias.is_cuda(), "bias must be a CUDA tensor");

    int batch_size = input.size(0);
    int features = input.size(1);
    auto output = torch::empty_like(input);

    dim3 block(256);
    dim3 grid(batch_size, (features + 255) / 256);

    fusedBiasGeluKernel<<<grid, block>>>(
        output.data_ptr<float>(),
        input.data_ptr<float>(),
        bias.data_ptr<float>(),
        batch_size, features
    );

    return output;
}

// Python 바인딩
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("fused_bias_relu", &fused_bias_relu,
          "Fused Bias + ReLU (CUDA)");
    m.def("fused_bias_gelu", &fused_bias_gelu,
          "Fused Bias + GELU (CUDA)");
}

setup.py:

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name='fused_ops',
    ext_modules=[
        CUDAExtension(
            name='fused_ops',
            sources=[
                'fused_ops.cpp',
                'fused_ops_kernel.cu',
            ],
            extra_compile_args={
                'cxx': ['-O3'],
                'nvcc': [
                    '-O3',
                    '-arch=sm_80',
                    '--use_fast_math',
                    '-Xptxas', '-v',  # 레지스터 사용량 출력
                ],
            }
        )
    ],
    cmdclass={
        'build_ext': BuildExtension
    }
)

빌드 및 사용:

# 빌드
cd custom_op
python setup.py install

# 또는 JIT 컴파일
python -c "
import torch
from torch.utils.cpp_extension import load

fused_ops = load(
    name='fused_ops',
    sources=['fused_ops.cpp', 'fused_ops_kernel.cu'],
    extra_cuda_cflags=['-O3', '-arch=sm_80']
)
print('Custom ops loaded!')
"

import torch
import fused_ops

# 커스텀 연산 사용
batch_size, features = 128, 4096
x = torch.randn(batch_size, features, device='cuda')
bias = torch.randn(features, device='cuda')

# PyTorch 기본 구현
y_ref = torch.relu(x + bias)

# 커스텀 Fused 구현
y_custom = fused_ops.fused_bias_relu(x, bias)

# 결과 검증
print(f"Max diff: {(y_ref - y_custom).abs().max():.6f}")

# 성능 비교
import time

def benchmark(fn, *args, warmup=10, runs=100):
    for _ in range(warmup):
        fn(*args)
    torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(runs):
        fn(*args)
    torch.cuda.synchronize()
    end = time.perf_counter()
    return (end - start) / runs * 1000  # ms

ref_time = benchmark(lambda x, b: torch.relu(x + b), x, bias)
custom_time = benchmark(fused_ops.fused_bias_relu, x, bias)

print(f"PyTorch: {ref_time:.3f} ms")
print(f"Custom (fused): {custom_time:.3f} ms")
print(f"Speedup: {ref_time / custom_time:.2f}x")

10.2 Flash Attention 개념 구현

Flash Attention은 Attention 연산의 메모리 사용량을 O(N²)에서 O(N)으로 줄이는 기법입니다:

// Flash Attention Forward (단순화된 버전)
// 실제 구현은 Tri Dao의 flash-attention 라이브러리 참조
__global__ void flashAttentionForward(
    float* output,       // [B, H, N, D]
    const float* Q,      // [B, H, N, D]
    const float* K,      // [B, H, N, D]
    const float* V,      // [B, H, N, D]
    float* L,            // [B, H, N] - log-sum-exp
    int B, int H, int N, int D,
    int BLOCK_N          // 타일 크기
) {
    // 각 블록이 Q의 BLOCK_N 행을 처리
    int b = blockIdx.z;
    int h = blockIdx.y;
    int q_block = blockIdx.x;

    int q_start = q_block * BLOCK_N;

    extern __shared__ float smem[];
    float* q_tile = smem;                // BLOCK_N x D
    float* k_tile = smem + BLOCK_N * D; // BLOCK_N x D
    float* v_tile = k_tile + BLOCK_N * D;

    // Q 타일 로드
    for (int i = threadIdx.x; i < BLOCK_N * D; i += blockDim.x) {
        int qi = i / D, di = i % D;
        int q_idx = q_start + qi;
        q_tile[i] = (q_idx < N)
            ? Q[((b * H + h) * N + q_idx) * D + di] : 0.0f;
    }

    // 누적 통계 초기화
    float m_prev[32] = {};  // 이전 최대값
    float l_prev[32] = {};  // 이전 합계
    float o_acc[32 * 64] = {};  // 출력 누산기

    // K, V 타일 순회
    for (int kv_block = 0; kv_block < (N + BLOCK_N - 1) / BLOCK_N; kv_block++) {
        int kv_start = kv_block * BLOCK_N;

        // K 타일 로드
        for (int i = threadIdx.x; i < BLOCK_N * D; i += blockDim.x) {
            int ki = i / D, di = i % D;
            int k_idx = kv_start + ki;
            k_tile[i] = (k_idx < N)
                ? K[((b * H + h) * N + k_idx) * D + di] : 0.0f;
        }
        __syncthreads();

        // 각 스레드가 하나의 Q 행 처리
        int tid = threadIdx.x;
        if (tid < BLOCK_N && (q_start + tid) < N) {
            // QK^T 계산 및 softmax (tiled)
            float m_new = m_prev[tid];
            for (int k = 0; k < BLOCK_N && (kv_start + k) < N; k++) {
                float dot = 0.0f;
                for (int d = 0; d < D; d++) {
                    dot += q_tile[tid * D + d] * k_tile[k * D + d];
                }
                dot /= sqrtf((float)D);
                m_new = fmaxf(m_new, dot);
            }

            // Numerically stable softmax 업데이트
            float l_new = expf(m_prev[tid] - m_new) * l_prev[tid];
            for (int k = 0; k < BLOCK_N && (kv_start + k) < N; k++) {
                // ... V 가중합 업데이트
            }
        }
        __syncthreads();
    }
}

11. 성능 최적화 체크리스트

11.1 메모리 최적화

Global Memory Coalescing 확인: Warp 내 스레드가 연속된 주소에 접근하는지 확인
Shared Memory 활용: 반복 접근되는 데이터는 Shared Memory에 캐시
Bank Conflict 제거: Shared Memory 접근 패턴에 패딩 추가
Register Spilling 방지: --ptxas-options=-v로 레지스터 사용량 확인
전송 최소화: H2D/D2H 전송을 배치하고 비동기 처리

11.2 실행 최적화

Occupancy 최대화: cudaOccupancyMaxPotentialBlockSize 활용
Warp Divergence 제거: 분기를 Warp 경계에 정렬
Stream 파이프라인: 데이터 전송과 연산을 오버랩
Kernel Fusion: 여러 단계를 하나의 커널로 합쳐 메모리 왕복 감소
Mixed Precision: Tensor Core 활용 가능한 FP16/BF16 사용

11.3 라이브러리 활용

cuBLAS: 행렬 연산에 직접 구현 대신 항상 cuBLAS 우선
cuDNN: 딥러닝 프리미티브는 cuDNN 사용
cuSPARSE: 희소 행렬 연산
Thrust: 병렬 알고리즘 (sort, reduce, scan 등)
CUB: 블록/Warp 수준 집합 연산 프리미티브

12. 실전 벤치마크 결과

A100 80GB GPU 기준 참고 수치:

연산	크기	시간	성능
벡터 덧셈 (FP32)	1M 요소	0.1ms	80 GB/s
행렬 곱셈 Naive	1024x1024	8ms	2.6 TFLOPS
행렬 곱셈 Tiled	1024x1024	2ms	10.4 TFLOPS
cuBLAS SGEMM	1024x1024	0.3ms	72 TFLOPS
cuBLAS HGEMM	4096x4096	1.5ms	183 TFLOPS
Convolution (cuDNN)	224x224x64	0.5ms	-

참고 자료

CUDA C++ Programming Guide - NVIDIA 공식 문서
CUDA Runtime API - API 레퍼런스
CUDA Toolkit - 설치 및 샘플
cuDNN Developer Guide - 딥러닝 프리미티브
cuBLAS Library - BLAS 라이브러리
Nsight Systems - 시스템 프로파일러
PyTorch CUDA Notes - PyTorch CUDA 활용
PyTorch C++ Extension Tutorial - 커스텀 연산
CUDA Best Practices Guide - 최적화 가이드
Flash Attention - 효율적인 Attention 구현

CUDA Programming Complete Guide: GPU Parallel Computing Zero to Hero

Overview

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It enables developers to harness the thousands of cores in a GPU to achieve computational throughput impossible with CPU alone. CUDA has become a foundational technology across deep learning, scientific simulation, image processing, and many other domains.

This guide targets both developers new to CUDA and engineers seeking advanced optimization techniques, walking through GPU architecture, kernel development, memory hierarchy, and real-world custom deep learning operators step by step.

1. GPU Architecture

1.1 CPU vs GPU Design Philosophy

A CPU is a general-purpose processor designed for complex control flow and low latency. Modern CPUs have 16–64 cores, each optimized for high-performance single-threaded execution. A GPU, by contrast, has thousands of simpler cores and operates under the SIMT (Single Instruction, Multiple Threads) model — applying the same operation to large volumes of data simultaneously.

For an NVIDIA A100 GPU:

6912 CUDA Cores (FP32 operations)
432 Tensor Cores (matrix operation acceleration)
80 GB HBM2e memory (2 TB/s bandwidth)
312 TFLOPS (FP16 Tensor Core peak)

1.2 NVIDIA GPU Architecture

Streaming Multiprocessor (SM)

The SM is the fundamental execution unit of a GPU. The A100 has 108 SMs, each containing:

64 CUDA Cores (FP32)
4 Tensor Cores (3rd generation)
256 KB L1 Cache / Shared Memory (configurable)
32 Load/Store Units
4 Special Function Units (SFUs)
Up to 2048 concurrent threads

CUDA Core vs Tensor Core

A CUDA Core processes a single FP32 or INT32 operation per cycle. A Tensor Core accelerates matrix-multiply-accumulate (MMA) operations in hardware. The 3rd-generation Tensor Cores in A100 process a 4x4 FP16 matrix operation (16 FMAs) per cycle, delivering more than 8x the throughput of a CUDA Core.

1.3 Ampere Architecture (A100)

A100 GPU Structure
┌─────────────────────────────────────────┐
│  GPC (Graphics Processing Cluster) x 8  │
│  ┌─────────────────────────────────┐    │
│  │  TPC (Texture Processing Cluster)│   │
│  │  ┌──────────┐  ┌──────────┐    │    │
│  │  │    SM    │  │    SM    │    │    │
│  │  │ 64 CUDA  │  │ 64 CUDA  │    │    │
│  │  │  4 TC    │  │  4 TC    │    │    │
│  │  └──────────┘  └──────────┘    │    │
│  └─────────────────────────────────┘    │
│                                         │
│  L2 Cache: 40 MB                        │
│  HBM2e: 80 GB @ 2 TB/s                 │
└─────────────────────────────────────────┘

A100 highlights:

MIG (Multi-Instance GPU): Partition a single GPU into up to 7 independent instances
NVLink 3.0: 600 GB/s bidirectional GPU-to-GPU bandwidth
TF32: New numeric format combining FP32 range with FP16 precision
Structural Sparsity: 2:4 sparsity pattern doubles Tensor Core throughput

1.4 Hopper Architecture (H100)

H100, released in 2022, is NVIDIA's latest data-center GPU:

132 SMs, each with 128 CUDA Cores
4th-gen Tensor Cores: FP8 support, up to 6x A100 performance
Transformer Engine: Per-layer automatic FP8/FP16 switching for transformer models
NVLink 4.0: 900 GB/s bidirectional bandwidth
HBM3: 80 GB @ 3.35 TB/s

1.5 Thread Hierarchy

CUDA uses a three-level thread hierarchy:

Grid  (entire kernel launch)
  └── Block (runs on one SM; threads share shared memory)
        └── Thread (individual execution unit)

Each level is three-dimensional:

// Example: 2D Grid of 2D Blocks
dim3 gridDim(32, 32, 1);    // 32 x 32 = 1024 blocks
dim3 blockDim(16, 16, 1);   // 16 x 16 = 256 threads per block
// Total threads: 1024 * 256 = 262,144

Warp

A warp is the SM's execution unit comprising 32 threads. All threads within a warp execute the same instruction simultaneously (SIMT). Much of CUDA optimization stems from understanding warp-level behavior.

1.6 Memory Hierarchy

Latency (low to high)
Register File   : ~1 cycle,    size: 64 KB/SM, up to 255 regs/thread
Shared Memory   : ~5 cycles,   size: up to 164 KB/SM (Ampere)
L1 Cache        : ~30 cycles,  size: unified with shared memory
L2 Cache        : ~200 cycles, size: 40 MB (A100)
Global Memory   : ~600 cycles, size: 80 GB (HBM2e)

Registers: Exclusively owned by each thread. Fastest. Excess register usage causes spilling to Local Memory (physically global memory).

Shared Memory: Shared among threads in the same block. Must be managed explicitly; physically shares space with L1 Cache.

Global Memory: Accessible by all threads. High latency makes coalesced access patterns essential.

2. CUDA Development Environment

2.1 Installing CUDA Toolkit

Download the appropriate package from https://developer.nvidia.com/cuda-toolkit.

On Ubuntu 22.04:

# Install CUDA keyring
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb

# Update and install
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

# Environment variables (add to ~/.bashrc or ~/.zshrc)
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Verify installation:

nvcc --version
# nvcc: NVIDIA (R) Cuda compiler driver
# Cuda compilation tools, release 12.3, V12.3.107

nvidia-smi
# Shows driver version, CUDA version, and GPU status

2.2 nvcc Compiler

nvcc is NVIDIA's CUDA C/C++ compiler. It compiles host code with g++/clang++ and device code to PTX, then to SASS (actual GPU instructions).

# Basic compilation
nvcc -o hello_cuda hello_cuda.cu

# With optimization flags
nvcc -O3 -arch=sm_80 -o matmul matmul.cu

# Debug build
nvcc -G -g -o debug_kernel debug_kernel.cu

# Multi-architecture support
nvcc -gencode arch=compute_75,code=sm_75 \
     -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_86,code=sm_86 \
     -o multi_arch multi_arch.cu

# Key flags
# -arch=sm_XX : target architecture (sm_80 = Ampere A100)
# -O3         : optimization level
# -G          : device-side debug info
# -lineinfo   : line info for profiling
# --ptxas-options=-v : print register/memory usage

2.3 nvidia-smi

# Real-time GPU monitoring
nvidia-smi dmon -s u   # GPU utilization
nvidia-smi dmon -s m   # memory utilization

# Detailed info for a specific GPU
nvidia-smi -q -d MEMORY,UTILIZATION

# Lock GPU clocks for benchmarking
sudo nvidia-smi -pm 1              # Enable persistence mode
sudo nvidia-smi --lock-gpu-clocks=1410

3. CUDA Programming Fundamentals

3.1 Hello CUDA World

#include <cuda_runtime.h>
#include <stdio.h>

// __global__: runs on GPU, called from CPU
__global__ void helloKernel() {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    printf("Hello from Thread %d (Block %d, Thread %d)\n",
           tid, blockIdx.x, threadIdx.x);
}

int main() {
    // 2 blocks, 4 threads per block = 8 total threads
    helloKernel<<<2, 4>>>();
    cudaDeviceSynchronize();
    return 0;
}

CUDA function qualifiers:

__global__: runs on GPU, called from CPU (kernel)
__device__: runs on GPU, called from GPU only
__host__: runs on CPU, called from CPU (default)
__host__ __device__: can run on both

3.2 Thread Indexing

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void indexingDemo() {
    // 1D indexing
    int global_tid_1d = threadIdx.x + blockIdx.x * blockDim.x;

    // 2D indexing
    int row = threadIdx.y + blockIdx.y * blockDim.y;
    int col = threadIdx.x + blockIdx.x * blockDim.x;
    int global_tid_2d = row * (gridDim.x * blockDim.x) + col;

    // 3D indexing
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    int z = threadIdx.z + blockIdx.z * blockDim.z;
    int width  = gridDim.x * blockDim.x;
    int height = gridDim.y * blockDim.y;
    int global_tid_3d = z * (width * height) + y * width + x;

    if (global_tid_1d == 0) {
        printf("Block dims: (%d, %d, %d)\n",
               blockDim.x, blockDim.y, blockDim.z);
        printf("Grid dims:  (%d, %d, %d)\n",
               gridDim.x, gridDim.y, gridDim.z);
    }
}

int main() {
    dim3 block(4, 4, 1);
    dim3 grid(2, 2, 1);
    indexingDemo<<<grid, block>>>();
    cudaDeviceSynchronize();
    return 0;
}

3.3 Vector Addition

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define CUDA_CHECK(err) \
    do { \
        cudaError_t e = (err); \
        if (e != cudaSuccess) { \
            fprintf(stderr, "CUDA error %s:%d: %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(e)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

__global__ void vectorAdd(const float* A, const float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    const int N = 1 << 20;  // 1M elements
    const size_t bytes = N * sizeof(float);

    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    for (int i = 0; i < N; i++) {
        h_A[i] = (float)rand() / RAND_MAX;
        h_B[i] = (float)rand() / RAND_MAX;
    }

    float *d_A, *d_B, *d_C;
    CUDA_CHECK(cudaMalloc(&d_A, bytes));
    CUDA_CHECK(cudaMalloc(&d_B, bytes));
    CUDA_CHECK(cudaMalloc(&d_C, bytes));

    CUDA_CHECK(cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice));

    int blockSize = 256;
    int gridSize = (N + blockSize - 1) / blockSize;
    vectorAdd<<<gridSize, blockSize>>>(d_A, d_B, d_C, N);

    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    CUDA_CHECK(cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost));

    float maxError = 0.0f;
    for (int i = 0; i < N; i++) {
        maxError = fmaxf(maxError, fabsf(h_C[i] - (h_A[i] + h_B[i])));
    }
    printf("Max error: %e\n", maxError);

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

3.4 Matrix Multiplication (Naive)

#include <cuda_runtime.h>
#include <stdio.h>

#define N 1024

__global__ void matMulNaive(const float* A, const float* B, float* C, int n) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < n && col < n) {
        float sum = 0.0f;
        for (int k = 0; k < n; k++) {
            sum += A[row * n + k] * B[k * n + col];
        }
        C[row * n + col] = sum;
    }
}

int main() {
    int n = N;
    size_t bytes = n * n * sizeof(float);

    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    for (int i = 0; i < n * n; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 1.0f;
    }

    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytes);
    cudaMalloc(&d_B, bytes);
    cudaMalloc(&d_C, bytes);

    cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);

    dim3 blockDim(16, 16);
    dim3 gridDim((n + blockDim.x - 1) / blockDim.x,
                 (n + blockDim.y - 1) / blockDim.y);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    matMulNaive<<<gridDim, blockDim>>>(d_A, d_B, d_C, n);
    cudaEventRecord(stop);

    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    printf("Matrix multiplication (%dx%d): %.3f ms\n", n, n, milliseconds);

    double flops = 2.0 * n * n * n;
    double gflops = flops / (milliseconds * 1e6);
    printf("Performance: %.2f GFLOPS\n", gflops);

    cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost);
    printf("C[0][0] = %.1f (expected %.1f)\n", h_C[0], (float)n);

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

4. CUDA Memory Management

4.1 cudaMalloc / cudaFree / cudaMemcpy

// Basic memory operations
float *d_ptr;
cudaMalloc(&d_ptr, size);
cudaMemset(d_ptr, 0, size);
cudaMemcpy(d_ptr, h_ptr, size, cudaMemcpyHostToDevice);
cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);
cudaMemcpy(d_dst, d_src, size, cudaMemcpyDeviceToDevice);
cudaFree(d_ptr);

// 2D memory (useful for matrices)
float *d_matrix;
size_t pitch;
cudaMallocPitch(&d_matrix, &pitch, width * sizeof(float), height);
cudaMemcpy2D(d_matrix, pitch,
             h_matrix, width * sizeof(float),
             width * sizeof(float), height,
             cudaMemcpyHostToDevice);

4.2 Unified Memory

Unified Memory lets CPU and GPU share the same virtual address space.

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void incrementKernel(int* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] += 1;
    }
}

int main() {
    const int N = 1 << 20;
    int *data;

    // Unified Memory allocation — accessible from both CPU and GPU
    cudaMallocManaged(&data, N * sizeof(int));

    // Initialize on CPU (no cudaMemcpy needed)
    for (int i = 0; i < N; i++) {
        data[i] = i;
    }

    // Process on GPU
    incrementKernel<<<(N + 255) / 256, 256>>>(data, N);
    cudaDeviceSynchronize();

    // Check results on CPU (no cudaMemcpy needed)
    printf("data[0] = %d (expected 1)\n", data[0]);
    printf("data[N-1] = %d (expected %d)\n", data[N-1], N);

    cudaFree(data);
    return 0;
}

On Ampere and later, use prefetch hints to improve performance:

// Move data to GPU ahead of kernel launch
cudaMemPrefetchAsync(data, N * sizeof(int), deviceId);
// Move data back to CPU
cudaMemPrefetchAsync(data, N * sizeof(int), cudaCpuDeviceId);
// Memory access hint
cudaMemAdvise(data, N * sizeof(int),
              cudaMemAdviseSetPreferredLocation, deviceId);

4.3 Tiled Matrix Multiplication with Shared Memory

Using shared memory dramatically reduces global memory accesses.

#include <cuda_runtime.h>

#define TILE_SIZE 16

__global__ void matMulTiled(const float* A, const float* B, float* C, int n) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < (n + TILE_SIZE - 1) / TILE_SIZE; t++) {
        int aCol = t * TILE_SIZE + threadIdx.x;
        int bRow = t * TILE_SIZE + threadIdx.y;

        tileA[threadIdx.y][threadIdx.x] = (row < n && aCol < n)
            ? A[row * n + aCol] : 0.0f;
        tileB[threadIdx.y][threadIdx.x] = (bRow < n && col < n)
            ? B[bRow * n + col] : 0.0f;

        __syncthreads();

        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();
    }

    if (row < n && col < n) {
        C[row * n + col] = sum;
    }
}

Performance improvement:

Naive: 2N global memory accesses per thread
Tiled: tiles loaded N/TILE_SIZE times, each reused TILE_SIZE times
Result: TILE_SIZE-fold reduction in global memory traffic (16x for TILE_SIZE=16)

4.4 Pinned Memory

#include <cuda_runtime.h>

void compareBandwidth(int N) {
    size_t bytes = N * sizeof(float);
    float *h_pageable, *h_pinned;
    float *d_data;

    h_pageable = (float*)malloc(bytes);
    cudaMallocHost(&h_pinned, bytes);  // Page-locked allocation
    cudaMalloc(&d_data, bytes);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Measure pageable bandwidth
    cudaEventRecord(start);
    for (int i = 0; i < 10; i++) {
        cudaMemcpy(d_data, h_pageable, bytes, cudaMemcpyHostToDevice);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float pageable_ms;
    cudaEventElapsedTime(&pageable_ms, start, stop);

    // Measure pinned bandwidth
    cudaEventRecord(start);
    for (int i = 0; i < 10; i++) {
        cudaMemcpy(d_data, h_pinned, bytes, cudaMemcpyHostToDevice);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float pinned_ms;
    cudaEventElapsedTime(&pinned_ms, start, stop);

    printf("Pageable: %.2f GB/s\n",
           (bytes * 10.0) / (pageable_ms * 1e6));
    printf("Pinned:   %.2f GB/s\n",
           (bytes * 10.0) / (pinned_ms * 1e6));

    free(h_pageable);
    cudaFreeHost(h_pinned);
    cudaFree(d_data);
}

Pinned memory is locked in physical RAM, preventing OS page-outs. This enables direct DMA transfers without CPU involvement, yielding 2–3x better transfer rates. Excessive use can impact overall system memory performance.

5. Advanced CUDA Optimization

5.1 Memory Coalescing

When all 32 threads in a warp access contiguous memory addresses, the hardware merges them into a single transaction. Non-contiguous (strided) access requires multiple transactions.

// Bad: strided access (non-coalesced)
__global__ void badAccess(float* data, int N, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N / stride) {
        data[idx * stride] *= 2.0f;
    }
}

// Good: sequential access (coalesced)
__global__ void goodAccess(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] *= 2.0f;
    }
}

// Coalesced matrix transpose using shared memory
#define TILE 32
#define PADDING 1  // Prevent bank conflicts

__global__ void transposeCoalesced(const float* in, float* out,
                                    int rows, int cols) {
    __shared__ float tile[TILE][TILE + PADDING];

    int x = blockIdx.x * TILE + threadIdx.x;
    int y = blockIdx.y * TILE + threadIdx.y;

    // Coalesced read from input
    if (x < cols && y < rows) {
        tile[threadIdx.y][threadIdx.x] = in[y * cols + x];
    }
    __syncthreads();

    x = blockIdx.y * TILE + threadIdx.x;
    y = blockIdx.x * TILE + threadIdx.y;

    // Coalesced write to output
    if (x < rows && y < cols) {
        out[y * rows + x] = tile[threadIdx.x][threadIdx.y];
    }
}

5.2 Shared Memory Bank Conflicts

Shared memory is divided into 32 banks, interleaved at 4-byte granularity. When multiple threads in a warp access the same bank, accesses are serialized.

// Bank conflict: stride=1 has no conflict, stride=16 causes 2-way conflict
__global__ void bankConflictDemo() {
    __shared__ float shared[32];

    int tid = threadIdx.x;

    // No conflict: thread i accesses shared[i] (different banks)
    float val1 = shared[tid];

    // 2-way bank conflict: threads 0 and 16 both access bank 0
    float val2 = shared[tid * 2 % 32];

    // Broadcast (no conflict despite all threads reading shared[0])
    float val3 = shared[0];

    (void)val1; (void)val2; (void)val3;
}

// Solve bank conflicts with padding
__global__ void noBankConflict(float* data) {
    __shared__ float tile[32][33];  // 33 = 32 + 1 padding

    int tid = threadIdx.x;
    int bid = threadIdx.y;

    tile[bid][tid] = data[bid * 32 + tid];
    __syncthreads();

    data[tid * 32 + bid] = tile[bid][tid];
}

5.3 Minimizing Warp Divergence

// Bad: divergence within a warp
__global__ void warpDivergenceBad(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        if (idx % 2 == 0) {
            data[idx] = data[idx] * 2.0f;
        } else {
            data[idx] = data[idx] + 1.0f;
        }
        // Both branches execute sequentially — 2x slower
    }
}

// Good: branch-free
__global__ void noWarpDivergence(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float mask = (float)(idx % 2);
        data[idx] = data[idx] * (2.0f - mask) + mask;
    }
}

// Align branch boundaries to warp boundaries
__global__ void warpAlignedBranch(float* even_data, float* odd_data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warp_id = idx / 32;

    if (warp_id % 2 == 0) {
        // All threads in this warp take the same branch — no divergence
        if (idx < N) even_data[idx] *= 2.0f;
    } else {
        if (idx < N) odd_data[idx] += 1.0f;
    }
}

5.4 Occupancy Optimization

Occupancy is the ratio of active warps on an SM to the maximum number of warps that SM supports.

#include <cuda_runtime.h>
#include <cuda_occupancy.h>

void checkOccupancy() {
    void* kernelFunc = (void*)matMulTiled;

    int blockSize;
    int minGridSize;

    // Automatically compute optimal block size
    cudaOccupancyMaxPotentialBlockSize(
        &minGridSize, &blockSize,
        kernelFunc, 0, 0
    );

    printf("Optimal block size: %d\n", blockSize);
    printf("Minimum grid size:  %d\n", minGridSize);

    int maxActiveBlocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &maxActiveBlocks, kernelFunc, blockSize, 0);

    int device;
    cudaGetDevice(&device);
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, device);

    float occupancy =
        (maxActiveBlocks * blockSize / prop.warpSize)
        / (float)(prop.maxThreadsPerMultiProcessor / prop.warpSize);

    printf("Theoretical occupancy: %.1f%%\n", occupancy * 100);
}

// Use __launch_bounds__ to guide compiler register allocation
__global__ __launch_bounds__(256, 4)
void optimizedKernel(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] *= 2.0f;
    }
}

5.5 CUDA Streams and Asynchronous Execution

#include <cuda_runtime.h>

#define NUM_STREAMS 4
#define CHUNK_SIZE (1 << 20)

void asyncProcessing(float* h_in, float* h_out, int N) {
    cudaStream_t streams[NUM_STREAMS];
    float *d_in[NUM_STREAMS], *d_out[NUM_STREAMS];

    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamCreate(&streams[i]);
        cudaMalloc(&d_in[i],  CHUNK_SIZE * sizeof(float));
        cudaMalloc(&d_out[i], CHUNK_SIZE * sizeof(float));
    }

    int numChunks = (N + CHUNK_SIZE - 1) / CHUNK_SIZE;

    for (int chunk = 0; chunk < numChunks; chunk++) {
        int streamId = chunk % NUM_STREAMS;
        int offset = chunk * CHUNK_SIZE;
        int size = min(CHUNK_SIZE, N - offset);

        // Overlapping transfers and kernel execution across streams
        cudaMemcpyAsync(d_in[streamId], h_in + offset,
                        size * sizeof(float),
                        cudaMemcpyHostToDevice, streams[streamId]);

        // Kernel launched asynchronously in stream
        int blockSize = 256;
        int gridSize = (size + blockSize - 1) / blockSize;
        // myKernel<<<gridSize, blockSize, 0, streams[streamId]>>>(...);

        cudaMemcpyAsync(h_out + offset, d_out[streamId],
                        size * sizeof(float),
                        cudaMemcpyDeviceToHost, streams[streamId]);
    }

    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamSynchronize(streams[i]);
        cudaStreamDestroy(streams[i]);
        cudaFree(d_in[i]);
        cudaFree(d_out[i]);
    }
}

Cross-stream synchronization:

cudaEvent_t event;
cudaEventCreate(&event);

// Record event in stream0
cudaEventRecord(event, stream0);

// stream1 waits for the event
cudaStreamWaitEvent(stream1, event, 0);

5.6 Performance Measurement with cudaEvent

float measureKernelTime(void (*kernel)(float*, int),
                         float* d_data, int N, int runs) {
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Warm-up
    kernel(d_data, N);
    cudaDeviceSynchronize();

    cudaEventRecord(start);
    for (int i = 0; i < runs; i++) {
        kernel(d_data, N);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float totalMs = 0;
    cudaEventElapsedTime(&totalMs, start, stop);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return totalMs / runs;
}

6. cuBLAS — Linear Algebra Library

cuBLAS is NVIDIA's high-performance BLAS implementation. It automatically leverages Tensor Cores for far superior performance compared to hand-written kernels.

Reference: https://docs.nvidia.com/cuda/cublas/

6.1 Basic Setup

#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <stdio.h>

#define CUBLAS_CHECK(err) \
    do { \
        cublasStatus_t e = (err); \
        if (e != CUBLAS_STATUS_SUCCESS) { \
            fprintf(stderr, "cuBLAS error: %d at %s:%d\n", \
                    e, __FILE__, __LINE__); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

int main() {
    cublasHandle_t handle;
    CUBLAS_CHECK(cublasCreate(&handle));

    // cuBLAS uses column-major order (Fortran-style)
    // Be careful when working with C++ row-major matrices

    CUBLAS_CHECK(cublasDestroy(handle));
    return 0;
}

6.2 SGEMM — Single-Precision Matrix Multiplication

cuBLAS uses column-major storage. For row-major C = A _ B, compute Ct = Bt _ At in column-major terms.

// C = alpha * A * B + beta * C
// A: m x k,  B: k x n,  C: m x n
void cublasMatMul(cublasHandle_t handle,
                   int m, int n, int k,
                   const float* d_A, const float* d_B, float* d_C) {
    const float alpha = 1.0f;
    const float beta  = 0.0f;

    cublasSgemm(handle,
                CUBLAS_OP_N,
                CUBLAS_OP_N,
                n, m, k,
                &alpha,
                d_B, n,
                d_A, k,
                &beta,
                d_C, n);
}

int main() {
    int m = 1024, n = 1024, k = 1024;
    size_t bytesA = m * k * sizeof(float);
    size_t bytesB = k * n * sizeof(float);
    size_t bytesC = m * n * sizeof(float);

    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytesA);
    cudaMalloc(&d_B, bytesB);
    cudaMalloc(&d_C, bytesC);
    cudaMemset(d_A, 0, bytesA);
    cudaMemset(d_B, 0, bytesB);

    cublasHandle_t handle;
    cublasCreate(&handle);
    cublasSetMathMode(handle, CUBLAS_DEFAULT_MATH);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    cublasMatMul(handle, m, n, k, d_A, d_B, d_C);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms;
    cudaEventElapsedTime(&ms, start, stop);
    double tflops = 2.0 * m * n * k / (ms * 1e9);
    printf("cuBLAS SGEMM: %.2f ms, %.2f TFLOPS\n", ms, tflops);

    cublasDestroy(handle);
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    return 0;
}

6.3 Batched GEMM

Perform multiple small matrix multiplications in a single call:

void batchedMatMul(cublasHandle_t handle,
                    int m, int n, int k, int batchCount,
                    float** d_Aarray, float** d_Barray, float** d_Carray) {
    const float alpha = 1.0f;
    const float beta  = 0.0f;

    cublasSgemmBatched(handle,
                       CUBLAS_OP_N, CUBLAS_OP_N,
                       n, m, k,
                       &alpha,
                       (const float**)d_Barray, n,
                       (const float**)d_Aarray, k,
                       &beta,
                       d_Carray, n,
                       batchCount);
}

// Strided batched (contiguous memory layout)
void stridedBatchedMatMul(cublasHandle_t handle,
                           int m, int n, int k, int batchCount,
                           float* d_A, float* d_B, float* d_C) {
    const float alpha = 1.0f;
    const float beta  = 0.0f;

    long long strideA = (long long)m * k;
    long long strideB = (long long)k * n;
    long long strideC = (long long)m * n;

    cublasSgemmStridedBatched(handle,
                               CUBLAS_OP_N, CUBLAS_OP_N,
                               n, m, k,
                               &alpha,
                               d_B, n, strideB,
                               d_A, k, strideA,
                               &beta,
                               d_C, n, strideC,
                               batchCount);
}

7. cuDNN — Deep Learning Primitives

cuDNN provides GPU-accelerated implementations of convolution, pooling, normalization, and other deep learning operations. PyTorch, TensorFlow, and most deep learning frameworks use cuDNN as their backend.

Reference: https://docs.nvidia.com/deeplearning/cudnn/

7.1 cuDNN Convolution

#include <cudnn.h>
#include <cuda_runtime.h>
#include <stdio.h>

#define CUDNN_CHECK(err) \
    do { \
        cudnnStatus_t e = (err); \
        if (e != CUDNN_STATUS_SUCCESS) { \
            fprintf(stderr, "cuDNN error: %s at %s:%d\n", \
                    cudnnGetErrorString(e), __FILE__, __LINE__); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

int main() {
    cudnnHandle_t handle;
    CUDNN_CHECK(cudnnCreate(&handle));

    // Input: N=1, C=3, H=224, W=224 (NCHW format)
    int N = 1, C = 3, H = 224, W = 224;
    // Filter: K=64, C=3, R=3, S=3
    int K = 64, R = 3, S = 3;
    int pad_h = 1, pad_w = 1;
    int stride_h = 1, stride_w = 1;
    int dilation_h = 1, dilation_w = 1;

    cudnnTensorDescriptor_t inputDesc, outputDesc;
    cudnnFilterDescriptor_t filterDesc;
    cudnnConvolutionDescriptor_t convDesc;

    CUDNN_CHECK(cudnnCreateTensorDescriptor(&inputDesc));
    CUDNN_CHECK(cudnnCreateTensorDescriptor(&outputDesc));
    CUDNN_CHECK(cudnnCreateFilterDescriptor(&filterDesc));
    CUDNN_CHECK(cudnnCreateConvolutionDescriptor(&convDesc));

    CUDNN_CHECK(cudnnSetTensor4dDescriptor(
        inputDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, N, C, H, W));

    CUDNN_CHECK(cudnnSetFilter4dDescriptor(
        filterDesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, K, C, R, S));

    CUDNN_CHECK(cudnnSetConvolution2dDescriptor(
        convDesc, pad_h, pad_w, stride_h, stride_w,
        dilation_h, dilation_w,
        CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT));

    CUDNN_CHECK(cudnnSetConvolutionMathType(convDesc, CUDNN_TENSOR_OP_MATH));

    // Compute output dimensions
    int out_N, out_C, out_H, out_W;
    CUDNN_CHECK(cudnnGetConvolution2dForwardOutputDim(
        convDesc, inputDesc, filterDesc,
        &out_N, &out_C, &out_H, &out_W));
    printf("Output: %d x %d x %d x %d\n", out_N, out_C, out_H, out_W);

    CUDNN_CHECK(cudnnSetTensor4dDescriptor(
        outputDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
        out_N, out_C, out_H, out_W));

    // Find the best algorithm
    cudnnConvolutionFwdAlgoPerf_t perfResults[10];
    int returnedCount;
    CUDNN_CHECK(cudnnFindConvolutionForwardAlgorithm(
        handle, inputDesc, filterDesc, convDesc, outputDesc,
        10, &returnedCount, perfResults));

    cudnnConvolutionFwdAlgo_t algo = perfResults[0].algo;
    printf("Best algorithm: %d (%.3f ms)\n", algo, perfResults[0].time);

    // Allocate workspace
    size_t workspaceSize;
    CUDNN_CHECK(cudnnGetConvolutionForwardWorkspaceSize(
        handle, inputDesc, filterDesc, convDesc, outputDesc,
        algo, &workspaceSize));

    void* workspace;
    cudaMalloc(&workspace, workspaceSize);

    float *d_input, *d_filter, *d_output;
    cudaMalloc(&d_input,  N * C * H * W * sizeof(float));
    cudaMalloc(&d_filter, K * C * R * S * sizeof(float));
    cudaMalloc(&d_output, out_N * out_C * out_H * out_W * sizeof(float));

    float alpha = 1.0f, beta = 0.0f;
    CUDNN_CHECK(cudnnConvolutionForward(
        handle, &alpha,
        inputDesc, d_input,
        filterDesc, d_filter,
        convDesc, algo, workspace, workspaceSize,
        &beta,
        outputDesc, d_output));

    printf("Convolution completed successfully!\n");

    cudaFree(d_input); cudaFree(d_filter);
    cudaFree(d_output); cudaFree(workspace);
    cudnnDestroyTensorDescriptor(inputDesc);
    cudnnDestroyTensorDescriptor(outputDesc);
    cudnnDestroyFilterDescriptor(filterDesc);
    cudnnDestroyConvolutionDescriptor(convDesc);
    cudnnDestroy(handle);

    return 0;
}

8. Mixed Precision Training

8.1 Numeric Format Comparison

Format	Bits	Exponent	Mantissa	Dynamic Range	Precision
FP32	32	8	23	~1e38	7 digits
FP16	16	5	10	~65504	3 digits
BF16	16	8	7	~1e38	2 digits
TF32	19	8	10	~1e38	3 digits
FP8	8	5/4	2/3	Limited	Low

FP16 halves memory usage and enables Tensor Cores but has a narrow range prone to overflow/underflow. BF16 matches FP32's exponent range for better numerical stability.

8.2 FP16 in CUDA

#include <cuda_fp16.h>
#include <cuda_runtime.h>

// FP16 vector addition
__global__ void fp16VectorAdd(const __half* A, const __half* B,
                               __half* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = __hadd(A[idx], B[idx]);
    }
}

// Vectorized FP16 (process 2 elements at once)
__global__ void fp16VectorAddVec2(const __half2* A, const __half2* B,
                                   __half2* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N / 2) {
        C[idx] = __hadd2(A[idx], B[idx]);
    }
}

// FP32 <-> FP16 conversion
__global__ void convertF32toF16(const float* f32, __half* f16, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        f16[idx] = __float2half(f32[idx]);
    }
}

__global__ void convertF16toF32(const __half* f16, float* f32, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        f32[idx] = __half2float(f16[idx]);
    }
}

8.3 PyTorch AMP (Automatic Mixed Precision)

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

model = nn.Sequential(
    nn.Linear(1024, 4096),
    nn.ReLU(),
    nn.Linear(4096, 4096),
    nn.ReLU(),
    nn.Linear(4096, 10)
).cuda()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# GradScaler prevents FP16 gradient underflow via loss scaling
scaler = GradScaler()

def train_step(x, y):
    optimizer.zero_grad()

    # autocast automatically casts to FP16/BF16
    with autocast(device_type='cuda', dtype=torch.float16):
        output = model(x)
        loss = criterion(output, y)

    # Scaled backward pass
    scaler.scale(loss).backward()

    # Unscale before gradient clipping
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # step() only updates if gradients are finite
    scaler.step(optimizer)
    scaler.update()

    return loss.item()

# BF16 variant (Ampere and later — no loss scaling needed)
def train_step_bf16(x, y):
    optimizer.zero_grad()
    with autocast(device_type='cuda', dtype=torch.bfloat16):
        output = model(x)
        loss = criterion(output, y)
    loss.backward()
    optimizer.step()
    return loss.item()

batch_size = 128
for epoch in range(10):
    x = torch.randn(batch_size, 1024).cuda()
    y = torch.randint(0, 10, (batch_size,)).cuda()
    loss = train_step(x, y)
    if epoch % 2 == 0:
        print(f"Epoch {epoch}: loss = {loss:.4f}, "
              f"scale = {scaler.get_scale():.0f}")

8.4 Loss Scaling

FP16's 10-bit mantissa means gradients can flush to zero (underflow). Loss scaling multiplies the loss by a large factor before the backward pass, then divides before the optimizer step.

# Manual loss scaling (conceptual illustration)
scale_factor = 2**15  # 32768

with autocast():
    loss = criterion(output, y)

(loss * scale_factor).backward()

for param in model.parameters():
    if param.grad is not None:
        param.grad /= scale_factor

valid = all(
    torch.isfinite(p.grad).all()
    for p in model.parameters()
    if p.grad is not None
)

if valid:
    optimizer.step()
else:
    print("Skipping step: invalid gradients detected")

GradScaler automates this and dynamically adjusts the scale factor (doubles on success, halves on overflow).

9. CUDA Profiling and Debugging

9.1 Nsight Systems

Nsight Systems provides system-level profiling:

# Command-line profiling
nsys profile --trace=cuda,nvtx,osrt \
             --output=profile_output \
             ./my_application

# Open in GUI
nsys-ui profile_output.nsys-rep

# Generate text report
nsys stats profile_output.nsys-rep

9.2 Nsight Compute

Kernel-level deep analysis:

# Profile a specific kernel
ncu --set full \
    --kernel-name matMulTiled \
    --launch-count 1 \
    --output kernel_profile \
    ./my_application

# Collect selected metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
l1tex__t_sector_hit_rate.pct,\
l2cache__hit_rate.pct \
    ./my_application

9.3 NVTX Markers for Profiling Ranges

#include <nvtx3/nvToolsExt.h>

void profiledFunction(float* d_data, int N) {
    nvtxRangePushA("Data Transfer H2D");
    cudaMemcpy(d_data, h_data, N * sizeof(float),
               cudaMemcpyHostToDevice);
    nvtxRangePop();

    nvtxRangePushA("Kernel Execution");
    myKernel<<<gridDim, blockDim>>>(d_data, N);
    cudaDeviceSynchronize();
    nvtxRangePop();

    nvtxRangePushA("Data Transfer D2H");
    cudaMemcpy(h_output, d_data, N * sizeof(float),
               cudaMemcpyDeviceToHost);
    nvtxRangePop();
}

9.4 CUDA Memory Checking

# Detect out-of-bounds and use-after-free
compute-sanitizer --tool memcheck ./my_application

# Detect race conditions
compute-sanitizer --tool racecheck ./my_application

# Detect uninitialized memory access
compute-sanitizer --tool initcheck ./my_application

# Detect synchronization errors
compute-sanitizer --tool synccheck ./my_application

10. Custom CUDA Deep Learning Operators

10.1 PyTorch Custom CUDA Extension

Writing fused kernels reduces memory round-trips and unlocks performance beyond standard library composition. Reference: https://pytorch.org/tutorials/advanced/cpp_extension.html

Project layout:

custom_op/
  setup.py
  fused_ops.cpp
  fused_ops_kernel.cu

fused_ops_kernel.cu:

#include <cuda_runtime.h>
#include <cuda_fp16.h>

// Fused Bias + ReLU
__global__ void fusedBiasReluKernel(
    float* output,
    const float* input,
    const float* bias,
    int batch_size,
    int features
) {
    int b = blockIdx.x;
    int f = blockIdx.y * blockDim.x + threadIdx.x;

    if (b < batch_size && f < features) {
        float val = input[b * features + f] + bias[f];
        output[b * features + f] = fmaxf(val, 0.0f);
    }
}

// GELU approximation
// GELU(x) = x * Phi(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
__device__ float gelu_approx(float x) {
    const float sqrt_2_over_pi = 0.7978845608f;
    const float coeff = 0.044715f;
    float x3 = x * x * x;
    float inner = sqrt_2_over_pi * (x + coeff * x3);
    return 0.5f * x * (1.0f + tanhf(inner));
}

// Fused Bias + GELU
__global__ void fusedBiasGeluKernel(
    float* output,
    const float* input,
    const float* bias,
    int batch_size,
    int features
) {
    int b = blockIdx.x;
    int f = blockIdx.y * blockDim.x + threadIdx.x;

    if (b < batch_size && f < features) {
        float val = input[b * features + f] + bias[f];
        output[b * features + f] = gelu_approx(val);
    }
}

// Layer Normalization kernel
__global__ void layerNormKernel(
    float* output,
    const float* input,
    const float* gamma,
    const float* beta,
    int batch_size,
    int features,
    float eps
) {
    int b = blockIdx.x;
    if (b >= batch_size) return;

    __shared__ float shared_mean;
    __shared__ float shared_var;

    // Step 1: Compute mean (reduction)
    float sum = 0.0f;
    for (int f = threadIdx.x; f < features; f += blockDim.x) {
        sum += input[b * features + f];
    }
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        sum += __shfl_down_sync(0xffffffff, sum, offset);
    }
    if (threadIdx.x % warpSize == 0) {
        atomicAdd(&shared_mean, sum);
    }
    __syncthreads();

    float mean = shared_mean / features;

    // Step 2: Compute variance
    float var_sum = 0.0f;
    for (int f = threadIdx.x; f < features; f += blockDim.x) {
        float diff = input[b * features + f] - mean;
        var_sum += diff * diff;
    }
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        var_sum += __shfl_down_sync(0xffffffff, var_sum, offset);
    }
    if (threadIdx.x % warpSize == 0) {
        atomicAdd(&shared_var, var_sum);
    }
    __syncthreads();

    float inv_std = rsqrtf(shared_var / features + eps);

    // Step 3: Normalize and apply scale/shift
    for (int f = threadIdx.x; f < features; f += blockDim.x) {
        float normalized = (input[b * features + f] - mean) * inv_std;
        output[b * features + f] = gamma[f] * normalized + beta[f];
    }
}

fused_ops.cpp:

#include <torch/extension.h>
#include <cuda_runtime.h>

// Forward declarations
void fusedBiasReluKernel(float*, const float*, const float*, int, int);
void fusedBiasGeluKernel(float*, const float*, const float*, int, int);

torch::Tensor fused_bias_relu(torch::Tensor input, torch::Tensor bias) {
    TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
    TORCH_CHECK(bias.is_cuda(), "bias must be a CUDA tensor");
    TORCH_CHECK(input.is_contiguous(), "input must be contiguous");

    int batch_size = input.size(0);
    int features = input.size(1);
    auto output = torch::empty_like(input);

    dim3 block(256);
    dim3 grid(batch_size, (features + 255) / 256);

    fusedBiasReluKernel<<<grid, block>>>(
        output.data_ptr<float>(),
        input.data_ptr<float>(),
        bias.data_ptr<float>(),
        batch_size, features
    );

    return output;
}

torch::Tensor fused_bias_gelu(torch::Tensor input, torch::Tensor bias) {
    TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
    TORCH_CHECK(bias.is_cuda(), "bias must be a CUDA tensor");

    int batch_size = input.size(0);
    int features = input.size(1);
    auto output = torch::empty_like(input);

    dim3 block(256);
    dim3 grid(batch_size, (features + 255) / 256);

    fusedBiasGeluKernel<<<grid, block>>>(
        output.data_ptr<float>(),
        input.data_ptr<float>(),
        bias.data_ptr<float>(),
        batch_size, features
    );

    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("fused_bias_relu", &fused_bias_relu, "Fused Bias + ReLU (CUDA)");
    m.def("fused_bias_gelu", &fused_bias_gelu, "Fused Bias + GELU (CUDA)");
}

setup.py:

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name='fused_ops',
    ext_modules=[
        CUDAExtension(
            name='fused_ops',
            sources=[
                'fused_ops.cpp',
                'fused_ops_kernel.cu',
            ],
            extra_compile_args={
                'cxx': ['-O3'],
                'nvcc': [
                    '-O3',
                    '-arch=sm_80',
                    '--use_fast_math',
                    '-Xptxas', '-v',
                ],
            }
        )
    ],
    cmdclass={
        'build_ext': BuildExtension
    }
)

Build and use:

cd custom_op
python setup.py install

import torch
import fused_ops

batch_size, features = 128, 4096
x = torch.randn(batch_size, features, device='cuda')
bias = torch.randn(features, device='cuda')

# Reference: separate ops
y_ref = torch.relu(x + bias)

# Custom fused op
y_custom = fused_ops.fused_bias_relu(x, bias)

print(f"Max diff: {(y_ref - y_custom).abs().max():.6f}")

import time

def benchmark(fn, *args, warmup=10, runs=100):
    for _ in range(warmup):
        fn(*args)
    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(runs):
        fn(*args)
    torch.cuda.synchronize()
    return (time.perf_counter() - start) / runs * 1000

ref_time    = benchmark(lambda x, b: torch.relu(x + b), x, bias)
custom_time = benchmark(fused_ops.fused_bias_relu, x, bias)

print(f"PyTorch baseline:  {ref_time:.3f} ms")
print(f"Custom fused op:   {custom_time:.3f} ms")
print(f"Speedup:           {ref_time / custom_time:.2f}x")

10.2 Flash Attention Concept

Flash Attention reduces attention memory usage from O(N²) to O(N) by tiling the computation to keep the attention matrix in shared memory rather than global memory:

// Simplified Flash Attention Forward (illustrative)
__global__ void flashAttentionForward(
    float* output,
    const float* Q,
    const float* K,
    const float* V,
    float* L,
    int B, int H, int N, int D,
    int BLOCK_N
) {
    int b = blockIdx.z;
    int h = blockIdx.y;
    int q_block = blockIdx.x;
    int q_start = q_block * BLOCK_N;

    extern __shared__ float smem[];
    float* q_tile = smem;
    float* k_tile = smem + BLOCK_N * D;
    float* v_tile = k_tile + BLOCK_N * D;

    // Load Q tile
    for (int i = threadIdx.x; i < BLOCK_N * D; i += blockDim.x) {
        int qi = i / D, di = i % D;
        int q_idx = q_start + qi;
        q_tile[i] = (q_idx < N)
            ? Q[((b * H + h) * N + q_idx) * D + di] : 0.0f;
    }

    float m_prev[32] = {};   // running max values
    float l_prev[32] = {};   // running sum values

    // Iterate over K/V tiles
    for (int kv_block = 0; kv_block < (N + BLOCK_N - 1) / BLOCK_N; kv_block++) {
        int kv_start = kv_block * BLOCK_N;

        for (int i = threadIdx.x; i < BLOCK_N * D; i += blockDim.x) {
            int ki = i / D, di = i % D;
            int k_idx = kv_start + ki;
            k_tile[i] = (k_idx < N)
                ? K[((b * H + h) * N + k_idx) * D + di] : 0.0f;
        }
        __syncthreads();

        int tid = threadIdx.x;
        if (tid < BLOCK_N && (q_start + tid) < N) {
            float m_new = m_prev[tid];
            for (int k = 0; k < BLOCK_N && (kv_start + k) < N; k++) {
                float dot = 0.0f;
                for (int d = 0; d < D; d++) {
                    dot += q_tile[tid * D + d] * k_tile[k * D + d];
                }
                dot /= sqrtf((float)D);
                m_new = fmaxf(m_new, dot);
            }
            // Numerically stable running softmax update would continue here
        }
        __syncthreads();
    }
}

For production use, see the open-source Flash Attention implementation at https://github.com/Dao-AILab/flash-attention.

11. Optimization Checklist

11.1 Memory

Verify global memory coalescing: warp threads should access contiguous addresses
Use shared memory: cache frequently reused data
Eliminate bank conflicts: add padding to shared memory arrays
Prevent register spilling: check with --ptxas-options=-v
Batch H2D/D2H transfers: overlap with computation using streams

11.2 Execution

Maximize occupancy: use cudaOccupancyMaxPotentialBlockSize
Eliminate warp divergence: align branch boundaries with warp boundaries
Stream pipelining: overlap data transfers with kernel execution
Kernel fusion: merge multi-step operations to reduce memory round-trips
Mixed precision: use FP16/BF16 to activate Tensor Cores

11.3 Library Usage

cuBLAS: always prefer cuBLAS for matrix operations over manual kernels
cuDNN: use for all standard deep learning primitives
cuSPARSE: sparse matrix operations
Thrust: parallel algorithms (sort, reduce, scan)
CUB: block/warp-level collective operation primitives

12. Reference Benchmarks

A100 80 GB GPU, approximate figures:

Operation	Size	Time	Performance
Vector add (FP32)	1M elements	0.1 ms	80 GB/s
Naive matmul	1024x1024	8 ms	2.6 TFLOPS
Tiled matmul	1024x1024	2 ms	10.4 TFLOPS
cuBLAS SGEMM	1024x1024	0.3 ms	72 TFLOPS
cuBLAS HGEMM	4096x4096	1.5 ms	183 TFLOPS
Convolution (cuDNN)	224x224x64	0.5 ms	—

References

CUDA C++ Programming Guide — Official NVIDIA documentation
CUDA Runtime API — API reference
CUDA Toolkit — Installation and samples
cuDNN Developer Guide — Deep learning primitives
cuBLAS Library — BLAS library
Nsight Systems — System profiler
PyTorch CUDA Notes — PyTorch CUDA usage
PyTorch C++ Extension Tutorial — Custom operators
CUDA Best Practices Guide — Optimization guide
Flash Attention — Efficient attention implementation