Split View: cuDNN 완전 해부: 딥러닝 연산이 GPU에서 빛처럼 빠른 이유

cuDNN 완전 해부: 딥러닝 연산이 GPU에서 빛처럼 빠른 이유

들어가며: cuDNN이 없다면 무슨 일이 벌어지는가
1. cuDNN이 왜 필요한가: Raw CUDA vs 최적화 라이브러리
2. 컨볼루션의 4가지 알고리즘: cuDNN의 핵심
3. cuDNN Auto-Tuner: benchmark 모드의 진실
4. 배치 정규화와 커널 퓨전: 메모리 트래픽 감소의 마법
5. Attention 메커니즘과 FlashAttention: cuDNN의 한계를 넘어서
6. 메모리 레이아웃: NCHW vs NHWC
7. TensorRT: cuDNN의 다음 단계
8. 실제 LLM에서 cuDNN 호출 추적
9. 정리: cuDNN이 만드는 성능 차이의 원천

들어가며: cuDNN이 없다면 무슨 일이 벌어지는가

PyTorch에서 torch.nn.Conv2d를 호출할 때 내부에서 무슨 일이 벌어지는지 아는가? 단순히 "GPU에서 컨볼루션을 실행한다"가 전부가 아니다. cuDNN이 알아서 알고리즘을 선택하고, 메모리 레이아웃을 조정하고, 하드웨어에 맞는 최적화 커널을 실행한다.

Raw CUDA 커널로 직접 컨볼루션을 구현하면 cuDNN 대비 10~100배 느리다. 그 이유를 정확히 이해하는 것이 이 글의 목표다.

1. cuDNN이 왜 필요한가: Raw CUDA vs 최적화 라이브러리

CUDA는 범용 병렬 프로그래밍 모델이다. 어떤 연산이든 병렬화할 수 있지만, 딥러닝의 특정 연산(컨볼루션, 배치 정규화, 어텐션)을 위한 전문적인 최적화는 담겨 있지 않다.

cuDNN(CUDA Deep Neural Network library)은 이런 연산들을 위한 수작업으로 튜닝된 커널 라이브러리다. 같은 연산을 수행해도:

Naive CUDA 컨볼루션:       ~5 TFLOPS 달성 (H100 이론값의 0.5%)
cuDNN 최적화 컨볼루션:     ~900 TFLOPS 달성 (H100 이론값의 90%)
차이: ~180배

왜 이런 격차가 생기는가:

알고리즘 선택: 입력 크기에 따라 Direct, im2col+GEMM, Winograd, FFT 중 최적을 자동 선택
메모리 레이아웃 최적화: Tensor Core를 활성화하는 NHWC 레이아웃 사용
커널 퓨전: 여러 연산을 하나의 커널로 합쳐 메모리 왕복 횟수 감소
워프 수준 최적화: 메모리 코얼레싱, 레지스터 재사용 등 HW 특성 완전 활용

PyTorch와 TensorFlow는 cuDNN을 기반 라이브러리로 사용한다:

import torch

# 이 코드는 내부적으로 cuDNN을 호출
conv = torch.nn.Conv2d(64, 128, kernel_size=3, padding=1).cuda()
x = torch.randn(32, 64, 56, 56, device='cuda')
y = conv(x)  # → cudnnConvolutionForward() 호출

# cuDNN 사용 여부 확인
print(torch.backends.cudnn.enabled)   # True
print(torch.backends.cudnn.version())  # 예: 8906

2. 컨볼루션의 4가지 알고리즘: cuDNN의 핵심

컨볼루션은 딥러닝에서 가장 연산 집약적인 연산이다. cuDNN은 4가지 알고리즘을 상황에 맞게 선택한다.

알고리즘 A: Direct Convolution (Naive)

가장 직관적인 구현. 출력의 각 원소를 독립적으로 계산한다:

출력 픽셀 (i, j)에 대해:
  for c_out in range(C_out):
    for c_in in range(C_in):
      for kh in range(K):
        for kw in range(K):
          output[c_out, i, j] += input[c_in, i+kh, j+kw] * filter[c_out, c_in, kh, kw]

총 연산 수: N × C_out × C_in × K^2 × H_out × W_out

장점: 단순함. 단점: 메모리 접근 패턴이 비효율적. 대규모 입력에서 매우 느리다.

알고리즘 B: im2col + GEMM (cuDNN 기본값)

컨볼루션을 행렬 곱셈으로 변환하는 핵심 트릭. cuBLAS의 고도로 최적화된 GEMM을 활용할 수 있다.

im2col 변환:

입력 특징맵 [3×3 이미지, 3×3 커널]:

원본 입력:            im2col 변환 결과:
┌─────────┐           ┌────────────────────────────┐
│ 1  2  3 │  im2col   │ 1  2  4  5  (패치 0, 위치 0,0) │
│ 4  5  6 │ ────────→ │ 2  3  5  6  (패치 1, 위치 0,1) │
│ 7  8  9 │           │ 4  5  7  8  (패치 2, 위치 1,0) │
└─────────┘           │ 5  6  8  9  (패치 3, 위치 1,1) │
                      └────────────────────────────┘
각 행 = 수용 영역(receptive field) 하나의 모든 픽셀값

그 다음: output = filter_matrix × im2col_matrix
          (표준 GEMM → Tensor Core 완전 활용 가능!)

im2col의 단점: 입력 데이터를 재배열하므로 추가 메모리 필요 (입력의 K^2배).

# im2col이 어떻게 동작하는지 Python으로 보는 예시
import numpy as np

def im2col(input, kernel_h, kernel_w, stride=1, pad=0):
    N, C, H, W = input.shape
    out_h = (H + 2*pad - kernel_h) // stride + 1
    out_w = (W + 2*pad - kernel_w) // stride + 1

    # 패딩 적용
    img = np.pad(input, [(0,0),(0,0),(pad,pad),(pad,pad)], mode='constant')

    # im2col 행렬 생성: (C*kernel_h*kernel_w) × (N*out_h*out_w)
    col = np.zeros((N, C, kernel_h, kernel_w, out_h, out_w))
    for j in range(kernel_h):
        jj = j + stride * np.arange(out_h)
        for i in range(kernel_w):
            ii = i + stride * np.arange(out_w)
            col[:, :, j, i, :, :] = img[:, :, jj[:, None], ii[None, :]]

    col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
    return col
# cuDNN의 실제 im2col은 이보다 훨씬 최적화되어 있음

알고리즘 C: Winograd 알고리즘 (소형 커널 최강자)

3×3 컨볼루션처럼 소형 커널에서 cuDNN이 기본으로 선택하는 알고리즘. Winograd의 최소 필터링 알고리즘(1980)을 딥러닝에 적용한 것이다.

핵심 아이디어: 선형대수 변환으로 곱셈 횟수를 대폭 줄인다.

일반 3×3 컨볼루션 (2×2 출력):
- 입력 패치: 4×4 = 16개 원소
- 커널: 3×3 = 9개 원소
- 출력: 2×2 = 4개 원소
- 필요한 곱셈 수 (naive): 4 × 9 = 36번

Winograd F(2×2, 3×3) 변환 후:
- 변환된 입력: 4×4 = 16개 원소 (선형 변환, 곱셈 없음)
- 원소별 곱셈: 16번 (4×4)
- 역변환: 4×4 → 2×2 (선형 변환, 곱셈 없음)
- 필요한 곱셈 수: 16번
- 절감: 36 → 16 = 2.25배 감소

수식으로 표현하면:

Y = A^T [(G × g × G^T) ⊙ (B^T × d × B)] A

여기서:
d = 입력 타일 (4×4)
g = 3×3 커널
B, G, A = 고정된 변환 행렬 (사전 계산됨)
⊙ = 원소별 곱셈 (하다마르 곱)

핵심: G × g × G^T는 커널에 대해 한 번만 계산 (추론 시 미리 계산 가능)
      B^T × d × B는 각 입력 타일에 대해 계산
      둘을 원소별로 곱하면 36번 대신 16번의 곱셈으로 동일한 결과

cuDNN은 3×3, stride=1 컨볼루션에서 자동으로 Winograd를 선택한다. ResNet, VGG 등 대부분의 CNN이 3×3 컨볼루션을 주로 사용하기 때문에 실전에서 매우 중요하다.

알고리즘 D: FFT 기반 컨볼루션 (대형 커널)

7×7, 11×11처럼 대형 커널에는 주파수 영역 컨볼루션이 효율적이다.

시간 영역 컨볼루션:   O(N × K^2)  (N = 출력 크기, K = 커널 크기)
주파수 영역 컨볼루션: O(N × log N) (FFT 후 원소별 곱셈)

K가 클수록 FFT 방식이 유리:
K=3:  9 vs log(N) ≈ 7  → 큰 차이 없음
K=11: 121 vs log(N) ≈ 7 → FFT가 17배 적은 곱셈

3. cuDNN Auto-Tuner: benchmark 모드의 진실

# 이 한 줄이 실제로 하는 일은?
torch.backends.cudnn.benchmark = True

benchmark=False (기본값):

cuDNN이 입력 크기 기반으로 알고리즘을 휴리스틱으로 선택
첫 실행부터 빠르지만 최적이 아닐 수 있음

benchmark=True:

첫 번째 forward pass에서 cudnnFindConvolutionForwardAlgorithm()을 호출
현재 입력 크기에 대해 사용 가능한 모든 알고리즘을 실제로 실행하고 벤치마킹
가장 빠른 알고리즘을 선택해 캐시에 저장
이후 같은 입력 크기에서는 캐시된 알고리즘 사용

# benchmark=True의 효과를 직접 확인
import torch
import time

torch.backends.cudnn.benchmark = False
model = resnet50().cuda()
x = torch.randn(32, 3, 224, 224).cuda()

# 첫 번째 실행 (알고리즘 휴리스틱 선택)
t0 = time.time()
for _ in range(10): y = model(x)
torch.cuda.synchronize()
print(f"benchmark=False: {(time.time()-t0)*1000:.0f}ms")

torch.backends.cudnn.benchmark = True
# 첫 번째 실행에서 벤치마킹 (느림!)
t0 = time.time()
y = model(x)  # 이 한 번이 오래 걸림 (~수 초)
torch.cuda.synchronize()
print(f"첫 번째 실행 (벤치마킹 포함): {(time.time()-t0)*1000:.0f}ms")

# 이후 실행 (캐시된 최적 알고리즘 사용)
t0 = time.time()
for _ in range(10): y = model(x)
torch.cuda.synchronize()
print(f"benchmark=True (워밍업 후): {(time.time()-t0)*1000:.0f}ms")
# 보통 20-40% 빠름

주의: 입력 크기가 매 배치마다 달라지면 benchmark=True가 오히려 느릴 수 있다. 매번 새로운 벤치마킹이 필요하기 때문. 학습 데이터의 입력 크기가 고정된 경우에만 사용하라.

4. 배치 정규화와 커널 퓨전: 메모리 트래픽 감소의 마법

배치 정규화(Batch Normalization)는 겉보기에 단순하다:

y = gamma * (x - mean(x)) / sqrt(var(x) + eps) + beta

하지만 이것을 별도의 커널로 구현하면 치명적으로 비효율적이다:

Naive 구현의 메모리 왕복:
1. x 로드 (HBM에서 읽기)
2. mean 계산 → HBM에 저장
3. x 로드 + mean 로드 (2회 HBM 읽기)
4. variance 계산 → HBM에 저장
5. x 로드 + mean 로드 + var 로드 (3회 HBM 읽기)
6. 정규화 계산 → HBM에 저장
7. ReLU: HBM 읽기 → 계산 → HBM 쓰기

총: HBM 접근 약 10회

cuDNN은 BN + ReLU를 단일 커널로 퓨전한다:

Fused BN+ReLU 커널:
1. x를 레지스터/공유 메모리에 로드 (HBM 읽기 1회)
2. 레지스터에서 mean, variance 계산 (HBM 접근 없음)
3. 레지스터에서 정규화 (HBM 접근 없음)
4. 레지스터에서 ReLU (HBM 접근 없음)
5. 결과를 HBM에 저장 (HBM 쓰기 1회)

총: HBM 접근 2회 (5배 감소!)

# PyTorch에서 fused BN+ReLU 사용
import torch.nn as nn

# 기본 방식: Conv → BN → ReLU (3개 별도 커널)
class NaiveBlock(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.conv = nn.Conv2d(c, c, 3, padding=1)
        self.bn = nn.BatchNorm2d(c)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)    # 별도 cuDNN 커널
        x = self.relu(x)  # 별도 elementwise 커널
        return x

# cuDNN fused: torch.backends.cudnn.benchmark=True 환경에서
# cuDNN이 자동으로 conv+bn+relu를 하나의 커널로 최적화

최신 cuDNN (v8 이상)은 그래프 API를 통해 더 복잡한 퓨전도 지원한다:

# cuDNN Graph API (PyTorch 2.0+ torch.compile과 연계)
# Conv + BN + ReLU + Residual Add를 단일 커널로 처리
# torch.compile()이 자동으로 이런 퓨전을 수행
model = torch.compile(model)  # 내부적으로 cuDNN Graph API 활용

5. Attention 메커니즘과 FlashAttention: cuDNN의 한계를 넘어서

Transformer의 어텐션 연산을 표준적으로 구현하면:

# Standard Attention — 메모리 비효율적
def standard_attention(Q, K, V, scale):
    # S = Q × K^T : (N, N) 크기의 어텐션 점수 행렬 생성
    S = torch.matmul(Q, K.transpose(-2, -1)) * scale  # N^2 메모리!
    # 소프트맥스
    P = torch.softmax(S, dim=-1)
    # V와 곱하기
    O = torch.matmul(P, V)
    return O

# 문제: S는 O(N^2) 메모리를 차지함
# N=8192 (문서 길이), FP16 기준:
# S 크기 = 8192 * 8192 * 2 bytes = 128MB (헤드 1개당!)

FlashAttention은 cuDNN을 우회하는 완전히 새로운 CUDA 커널로 이 문제를 해결한다.

FlashAttention의 핵심 아이디어: HBM에 N^2 행렬을 쓰지 않는다

표준 어텐션 메모리 흐름:
Q, K, V → HBM 읽기
S = QK^T → HBM 쓰기 (N^2 크기!)
softmax(S) → HBM 쓰기
P × V → HBM 쓰기
총 HBM 접근: O(N^2)

FlashAttention 메모리 흐름:
Q, K, V를 타일 단위로 로드 → 공유 메모리
공유 메모리에서 어텐션 부분합 계산 (N^2 행렬 생성 없음)
최종 출력만 HBM에 저장
총 HBM 접근: O(N)  ← N^2에서 N으로!

타일링 트릭의 핵심:

FlashAttention 타일링:

Q를 블록 Q_1, Q_2, ... Q_Tc로 분할
K와 V를 블록 K_1, V_1, ... K_Tr, V_Tr로 분할

for i in range(Tc):
  Q_i를 SRAM에 로드
  O_i = 0, l_i = 0, m_i = -inf  (소프트맥스 통계 초기화)

  for j in range(Tr):
    K_j, V_j를 SRAM에 로드

    # SRAM에서만 계산 (HBM 접근 없음)
    S_ij = Q_i × K_j^T  (타일 크기의 어텐션 점수)
    m_ij = max(m_i, rowmax(S_ij))  (수치 안정성)
    P_ij = exp(S_ij - m_ij)

    # 온라인 소프트맥스 업데이트 (Numerically stable!)
    O_i = diag(exp(m_i - m_ij)) × O_i + P_ij × V_j
    l_i = exp(m_i - m_ij) × l_i + rowsum(P_ij)
    m_i = m_ij

  # 최종 정규화 후 HBM에 저장
  O_i = diag(l_i)^(-1) × O_i → HBM 쓰기 1회

결과:

메모리 사용량: O(N^2) → O(N) (N=8192 기준 수백 MB 절약)
속도: 2~4배 빠름 (HBM 트래픽 감소 덕분)

6. 메모리 레이아웃: NCHW vs NHWC

딥러닝 텐서의 메모리 레이아웃은 성능에 결정적인 영향을 준다.

NCHW 레이아웃 (batch × channel × height × width):
배치=1, 채널=3(RGB), 4×4 이미지:

메모리 배치:
[R00 R01 R02 R03 | R10 R11 ... | R30 R31 R32 R33 |
 G00 G01 G02 G03 | G10 G11 ... | G30 G31 G32 G33 |
 B00 B01 B02 B03 | B10 B11 ... | B30 B31 B32 B33]

→ 같은 채널의 픽셀들이 연속적으로 배치

NHWC 레이아웃 (batch × height × width × channel):
메모리 배치:
[R00 G00 B00 | R01 G01 B01 | R02 G02 B02 | R03 G03 B03 |
 R10 G10 B10 | R11 G11 B11 | ...
 R30 G30 B30 | R31 G31 B31 | R32 G32 B32 | R33 G33 B33]

→ 같은 공간 위치의 모든 채널이 연속적으로 배치

왜 Tensor Core는 NHWC를 선호하는가:

Tensor Core는 16×16 행렬 타일을 처리한다. NCHW에서 컨볼루션을 위한 16×16 타일을 구성하면 채널 방향으로 불연속적인 메모리 접근이 발생한다. NHWC에서는 채널이 연속적이라 타일 로딩이 연속 메모리 읽기가 된다.

# PyTorch에서 NHWC 사용
x_nchw = torch.randn(32, 64, 56, 56, device='cuda')

# NHWC로 변환 (channels_last 포맷)
x_nhwc = x_nchw.to(memory_format=torch.channels_last)

# channels_last 형식으로 모델 변환
model = model.to(memory_format=torch.channels_last)

# 이 상태에서 forward pass는 cuDNN이 NHWC 커널을 자동 선택
output = model(x_nhwc)

# 벤치마크 비교
import time
model_nchw = resnet50().cuda()
model_nhwc = resnet50().cuda().to(memory_format=torch.channels_last)
x = torch.randn(64, 3, 224, 224).cuda()
x_nhwc = x.to(memory_format=torch.channels_last)

# 실측 결과 (H100 기준):
# NCHW: ~12ms/batch
# NHWC: ~9ms/batch  (약 25% 빠름)

7. TensorRT: cuDNN의 다음 단계

TensorRT는 cuDNN 위에 구축된 추론 최적화 엔진이다. 학습 완료된 모델을 배포 환경에서 최대 성능으로 실행한다.

TensorRT의 최적화 파이프라인:

원본 모델 (ONNX/PyTorch)
      ↓
  그래프 분석
      ↓
  레이어 퓨전
  ┌─────────────────────────────────────┐
  │ Conv → BN → ReLU → Conv → BN → ReLU│
  │       ↓ 퓨전 후                     │
  │     단일 최적화 커널                 │
  └─────────────────────────────────────┘
      ↓
  정밀도 선택 (FP32 → FP16 → INT8)
      ↓
  커널 자동선택 (입력 크기별 벤치마킹)
      ↓
  최적화된 실행 엔진

정밀도 단계별 성능 비교 (ResNet-50, batch=32, H100 기준):

정밀도	레이턴시	처리량	정확도 손실
FP32	4.2ms	7,600 img/s	기준점
FP16	1.8ms	17,800 img/s	무시할 수준
INT8	0.9ms	35,500 img/s	~0.1% Top-1

INT8 양자화 캘리브레이션:

import tensorrt as trt

# INT8 캘리브레이터 설정
class MyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data_loader, cache_file):
        super().__init__()
        self.data_loader = iter(data_loader)
        self.cache_file = cache_file
        self.batch_allocation = None

    def get_batch_size(self):
        return 32

    def get_batch(self, names):
        try:
            batch = next(self.data_loader)[0].numpy()
            # 대표 데이터셋으로 activation 범위 캘리브레이션
            # TensorRT가 각 레이어의 FP32 범위를 INT8로 매핑
            if self.batch_allocation is None:
                self.batch_allocation = cuda.mem_alloc(batch.nbytes)
            cuda.memcpy_htod(self.batch_allocation, batch)
            return [int(self.batch_allocation)]
        except StopIteration:
            return None

# 빌더 설정
builder = trt.Builder(logger)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.INT8)  # INT8 활성화
config.int8_calibrator = MyCalibrator(calibration_loader, 'cache.bin')

8. 실제 LLM에서 cuDNN 호출 추적

GPT-2 / LLaMA를 실행할 때 내부에서 어떤 커널들이 호출되는지 추적해보자:

# PyTorch Profiler로 LLM 연산 추적
import torch
from torch.profiler import profile, ProfilerActivity

model = GPT2Model.from_pretrained('gpt2').cuda()
input_ids = torch.randint(0, 50257, (1, 512)).cuda()

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True
) as prof:
    output = model(input_ids)

# 상위 CUDA 커널 출력
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

실제 실행되는 커널들:

GPT-2 Forward Pass 내부 CUDA 커널 (512 토큰, batch=1):

연산              커널                          시간   비율
─────────────────────────────────────────────────────────
Linear (QKV)     cublasSgemm (Tensor Core)      2.1ms  35%
Attention        flash_attn_fwd_kernel           1.4ms  23%
Linear (output)  cublasSgemm (Tensor Core)      0.9ms  15%
LayerNorm        layer_norm_kernel               0.3ms   5%
GELU             vectorized_elementwise_kernel   0.2ms   3%
Residual Add     vectorized_elementwise_kernel   0.1ms   2%
Embedding        Embedding_cuda                  0.1ms   2%
기타                                            0.9ms  15%

총 시간: ~6ms (H100 기준)

핵심 관찰:

Linear 레이어 = cuBLAS GEMM: LLM 연산의 약 50%가 행렬 곱셈
Attention = FlashAttention: cuDNN이 아닌 커스텀 CUDA 커널
LayerNorm = 커스텀 퓨전 커널: 평균 + 분산 + 정규화가 하나의 커널
활성화 함수 = elementwise 커널: 매우 빠름 (메모리 대역폭 제한)

9. 정리: cuDNN이 만드는 성능 차이의 원천

cuDNN이 raw CUDA 대비 10~100배 빠른 이유를 정리하면:

알고리즘 최적화: Winograd, im2col+GEMM 등 수학적으로 더 효율적인 알고리즘 선택
커널 퓨전: 여러 연산을 하나의 커널로 합쳐 HBM 왕복 횟수 최소화
메모리 레이아웃 최적화: NHWC + Tensor Core 활성화로 하드웨어 효율 극대화
Auto-Tuner: 실제 하드웨어에서 벤치마킹으로 입력 크기별 최적 구현 선택
FlashAttention 같은 혁신: O(N^2) → O(N) 메모리 접근으로 어텐션 연산 혁신

PyTorch 한 줄 뒤에는 이런 수십 년의 최적화 연구가 숨어 있다. 이 내부 동작을 이해하는 것이 LLM 서빙 최적화, 커스텀 커널 작성, 하드웨어 선택의 출발점이다.

cuDNN Internals: Why Deep Learning Operations Fly on GPU

Introduction: What Happens Without cuDNN
1. Why cuDNN Exists: Raw CUDA vs Optimized Libraries
2. Four Convolution Algorithms: The Core of cuDNN
3. cuDNN Auto-Tuner: What benchmark Mode Actually Does
4. Batch Normalization and Kernel Fusion: Memory Traffic Reduction
5. Attention and FlashAttention: Going Beyond cuDNN
6. Memory Layout: NCHW vs NHWC
7. TensorRT: Beyond cuDNN for Inference
8. Tracing cuDNN Calls in a Real LLM
9. Summary: The Sources of cuDNN's Performance Advantage

Introduction: What Happens Without cuDNN

When you call torch.nn.Conv2d in PyTorch, do you know what actually happens under the hood? It's not simply "run a convolution on the GPU." cuDNN automatically selects an algorithm, adjusts memory layout, and executes hardware-specific optimized kernels.

Implementing convolution directly with raw CUDA kernels runs 10-100x slower than cuDNN. Understanding exactly why is the goal of this post.

1. Why cuDNN Exists: Raw CUDA vs Optimized Libraries

CUDA is a general-purpose parallel programming model. It can parallelize any computation, but it doesn't include specialized optimizations for the specific operations that dominate deep learning — convolutions, batch normalization, attention.

cuDNN (CUDA Deep Neural Network library) is a hand-tuned kernel library for these operations. The same computation, dramatically different performance:

Naive CUDA convolution:     ~5 TFLOPS achieved (0.5% of H100 theoretical)
cuDNN optimized convolution: ~900 TFLOPS achieved (90% of H100 theoretical)
Difference: ~180x

Why such a gap?

Algorithm selection: Automatically chooses Direct, im2col+GEMM, Winograd, or FFT based on input size
Memory layout optimization: Uses NHWC layout to activate Tensor Cores
Kernel fusion: Combines multiple operations into a single kernel to reduce memory round trips
Warp-level optimization: Fully exploits hardware characteristics — memory coalescing, register reuse

PyTorch and TensorFlow use cuDNN as their foundational library:

import torch

# This code calls cuDNN internally
conv = torch.nn.Conv2d(64, 128, kernel_size=3, padding=1).cuda()
x = torch.randn(32, 64, 56, 56, device='cuda')
y = conv(x)  # → calls cudnnConvolutionForward()

# Verify cuDNN is in use
print(torch.backends.cudnn.enabled)    # True
print(torch.backends.cudnn.version())  # e.g., 8906

2. Four Convolution Algorithms: The Core of cuDNN

Convolution is the most compute-intensive operation in deep learning. cuDNN selects from four algorithms based on the situation.

Algorithm A: Direct Convolution (Naive)

The most straightforward implementation. Computes each output element independently:

For each output pixel (i, j):
  for c_out in range(C_out):
    for c_in in range(C_in):
      for kh in range(K):
        for kw in range(K):
          output[c_out, i, j] += input[c_in, i+kh, j+kw] * filter[c_out, c_in, kh, kw]

Total ops: N × C_out × C_in × K^2 × H_out × W_out

Pros: Simple. Cons: Inefficient memory access patterns. Very slow for large inputs.

Algorithm B: im2col + GEMM (cuDNN Default)

The key trick that transforms convolution into matrix multiplication, enabling cuBLAS's highly optimized GEMM to take over.

The im2col transformation:

Input feature map [3x3 image, 3x3 kernel]:

Original input:        im2col result:
┌─────────┐            ┌────────────────────────────┐
│ 1  2  3 │  im2col    │ 1  2  4  5  (patch 0, pos 0,0) │
│ 4  5  6 │ ─────────→ │ 2  3  5  6  (patch 1, pos 0,1) │
│ 7  8  9 │            │ 4  5  7  8  (patch 2, pos 1,0) │
└─────────┘            │ 5  6  8  9  (patch 3, pos 1,1) │
                       └────────────────────────────┘
Each row = all pixel values in one receptive field

Then: output = filter_matrix × im2col_matrix
      (standard GEMM → Tensor Core fully utilized!)

Downside of im2col: requires additional memory to store the rearranged input (K^2 times the original input size).

# Python illustration of im2col behavior
import numpy as np

def im2col(input, kernel_h, kernel_w, stride=1, pad=0):
    N, C, H, W = input.shape
    out_h = (H + 2*pad - kernel_h) // stride + 1
    out_w = (W + 2*pad - kernel_w) // stride + 1

    # Apply padding
    img = np.pad(input, [(0,0),(0,0),(pad,pad),(pad,pad)], mode='constant')

    # Build im2col matrix: (C*kernel_h*kernel_w) x (N*out_h*out_w)
    col = np.zeros((N, C, kernel_h, kernel_w, out_h, out_w))
    for j in range(kernel_h):
        jj = j + stride * np.arange(out_h)
        for i in range(kernel_w):
            ii = i + stride * np.arange(out_w)
            col[:, :, j, i, :, :] = img[:, :, jj[:, None], ii[None, :]]

    col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
    return col
# cuDNN's actual im2col is far more optimized than this

Algorithm C: Winograd Algorithm (Champion for Small Kernels)

The algorithm cuDNN defaults to for 3x3 convolutions. It applies Winograd's minimal filtering algorithm (1980) to deep learning.

Core idea: Use linear algebra transformations to drastically reduce the number of multiplications.

Standard 3x3 convolution (2x2 output):
- Input patch: 4x4 = 16 elements
- Kernel: 3x3 = 9 elements
- Output: 2x2 = 4 elements
- Multiplications required (naive): 4 x 9 = 36

After Winograd F(2x2, 3x3) transformation:
- Transformed input: 4x4 = 16 elements (linear transform, no multiplications)
- Element-wise multiplication: 16 operations (4x4)
- Inverse transform: 4x4 → 2x2 (linear transform, no multiplications)
- Multiplications required: 16
- Reduction: 36 → 16 = 2.25x fewer multiplications

Expressed mathematically:

Y = A^T [(G × g × G^T) element-wise-multiply (B^T × d × B)] A

Where:
d = input tile (4x4)
g = 3x3 kernel
B, G, A = fixed transform matrices (precomputed constants)
element-wise-multiply = Hadamard product

Key insight: G × g × G^T is computed once per kernel
             (can be precomputed at inference time)
             B^T × d × B is computed per input tile
             Element-wise product gives 16 multiplications
             instead of 36

cuDNN automatically selects Winograd for 3x3, stride=1 convolutions. Since ResNet, VGG, and most CNNs predominantly use 3x3 convolutions, this matters enormously in practice.

Algorithm D: FFT-Based Convolution (Large Kernels)

For large kernels like 7x7 or 11x11, convolution in the frequency domain is efficient.

Spatial domain convolution:    O(N × K^2)   (N = output size, K = kernel size)
Frequency domain convolution:  O(N × log N) (FFT + element-wise multiply)

The larger K is, the more FFT wins:
K=3:  9 ops vs log(N) ≈ 7  → marginal difference
K=11: 121 ops vs log(N) ≈ 7 → FFT uses 17x fewer multiplications

3. cuDNN Auto-Tuner: What `benchmark` Mode Actually Does

# What does this single line actually do?
torch.backends.cudnn.benchmark = True

benchmark=False (default):

cuDNN selects an algorithm heuristically based on input size
Fast from the first run, but may not be the optimal choice

benchmark=True:

On the first forward pass, calls cudnnFindConvolutionForwardAlgorithm()
Actually runs and benchmarks ALL available algorithms for the current input size
Selects the fastest and caches it
All subsequent runs with the same input size use the cached algorithm

# See the benchmark effect directly
import torch
import time

torch.backends.cudnn.benchmark = False
model = resnet50().cuda()
x = torch.randn(32, 3, 224, 224).cuda()

t0 = time.time()
for _ in range(10): y = model(x)
torch.cuda.synchronize()
print(f"benchmark=False: {(time.time()-t0)*1000:.0f}ms")

torch.backends.cudnn.benchmark = True
# First run includes benchmarking (SLOW!)
t0 = time.time()
y = model(x)  # This single call takes several seconds
torch.cuda.synchronize()
print(f"First run (includes benchmarking): {(time.time()-t0)*1000:.0f}ms")

# Subsequent runs use cached optimal algorithm
t0 = time.time()
for _ in range(10): y = model(x)
torch.cuda.synchronize()
print(f"benchmark=True (after warmup): {(time.time()-t0)*1000:.0f}ms")
# Typically 20-40% faster

Caution: If input sizes vary every batch, benchmark=True can actually be slower — it triggers a new benchmarking pass for each new size. Only use it when training/inference input sizes are fixed.

4. Batch Normalization and Kernel Fusion: Memory Traffic Reduction

Batch Normalization looks simple on paper:

y = gamma * (x - mean(x)) / sqrt(var(x) + eps) + beta

But implementing this as separate kernels is catastrophically inefficient:

Naive implementation memory round trips:
1. Load x from HBM
2. Compute mean → store to HBM
3. Load x + load mean (2 HBM reads)
4. Compute variance → store to HBM
5. Load x + load mean + load var (3 HBM reads)
6. Compute normalization → store to HBM
7. ReLU: load from HBM → compute → store to HBM

Total: ~10 HBM accesses

cuDNN fuses BN + ReLU into a single kernel:

Fused BN+ReLU kernel:
1. Load x into registers/shared memory (1 HBM read)
2. Compute mean, variance in registers (zero HBM access)
3. Normalize in registers (zero HBM access)
4. ReLU in registers (zero HBM access)
5. Store result to HBM (1 HBM write)

Total: 2 HBM accesses (5x reduction!)

# Using fused BN+ReLU in PyTorch
import torch.nn as nn

# Naive: Conv → BN → ReLU (3 separate kernel launches)
class NaiveBlock(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.conv = nn.Conv2d(c, c, 3, padding=1)
        self.bn = nn.BatchNorm2d(c)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)    # Separate cuDNN kernel
        x = self.relu(x)  # Separate elementwise kernel
        return x

# With torch.compile(), the compiler automatically fuses these
# through cuDNN Graph API and nvFuser
model = torch.compile(model)  # Auto-fuses conv+bn+relu into one kernel

Modern cuDNN (v8+) supports more complex fusions through its Graph API, and torch.compile() leverages this automatically.

5. Attention and FlashAttention: Going Beyond cuDNN

Standard Transformer attention implemented straightforwardly:

# Standard Attention — memory inefficient
def standard_attention(Q, K, V, scale):
    # S = Q × K^T: creates N×N attention score matrix
    S = torch.matmul(Q, K.transpose(-2, -1)) * scale  # O(N^2) memory!
    # Softmax
    P = torch.softmax(S, dim=-1)
    # Multiply by V
    O = torch.matmul(P, V)
    return O

# Problem: S requires O(N^2) memory
# N=8192 (document length), FP16:
# S size = 8192 * 8192 * 2 bytes = 128MB per head!

FlashAttention solves this with a completely new CUDA kernel that bypasses cuDNN entirely.

FlashAttention's core idea: Never materialize the N×N attention matrix in HBM

Standard attention memory flow:
Q, K, V → read from HBM
S = QK^T → write to HBM (N^2 size!)
softmax(S) → write to HBM
P × V → write to HBM
Total HBM accesses: O(N^2)

FlashAttention memory flow:
Load Q, K, V in tiles → into shared memory
Compute partial attention sums in shared memory
  (never creates full N^2 matrix)
Write only the final output to HBM
Total HBM accesses: O(N)  ← from N^2 down to N!

The tiling trick in detail:

FlashAttention tiling:

Partition Q into blocks Q_1, Q_2, ... Q_Tc
Partition K, V into blocks K_1, V_1, ... K_Tr, V_Tr

for i in range(Tc):
  Load Q_i into SRAM
  O_i = 0, l_i = 0, m_i = -inf  (softmax statistics init)

  for j in range(Tr):
    Load K_j, V_j into SRAM

    # Compute entirely in SRAM (zero HBM access)
    S_ij = Q_i × K_j^T  (tile-sized attention scores)
    m_ij = max(m_i, rowmax(S_ij))  (numerical stability)
    P_ij = exp(S_ij - m_ij)

    # Online softmax update (numerically stable!)
    O_i = diag(exp(m_i - m_ij)) × O_i + P_ij × V_j
    l_i = exp(m_i - m_ij) × l_i + rowsum(P_ij)
    m_i = m_ij

  # Final normalization, then write to HBM (1 write)
  O_i = diag(l_i)^(-1) × O_i → store to HBM

Results:

Memory usage: O(N^2) → O(N) (saves hundreds of MB at N=8192)
Speed: 2-4x faster (thanks to reduced HBM traffic)

6. Memory Layout: NCHW vs NHWC

The memory layout of deep learning tensors has a decisive impact on performance.

NCHW layout (batch × channel × height × width):
Batch=1, channels=3 (RGB), 4x4 image:

Memory arrangement:
[R00 R01 R02 R03 | R10 R11 ... | R30 R31 R32 R33 |
 G00 G01 G02 G03 | G10 G11 ... | G30 G31 G32 G33 |
 B00 B01 B02 B03 | B10 B11 ... | B30 B31 B32 B33]

→ Pixels within the same channel are contiguous

NHWC layout (batch × height × width × channel):
Memory arrangement:
[R00 G00 B00 | R01 G01 B01 | R02 G02 B02 | R03 G03 B03 |
 R10 G10 B10 | R11 G11 B11 | ...
 R30 G30 B30 | R31 G31 B31 | R32 G32 B32 | R33 G33 B33]

→ All channels at the same spatial position are contiguous

Why Tensor Cores prefer NHWC:

Tensor Cores process 16x16 matrix tiles. Building a 16x16 tile for convolution under NCHW results in discontinuous memory accesses across channels. Under NHWC, channels are contiguous, making tile loading a sequential memory read.

# Using NHWC (channels_last) in PyTorch
x_nchw = torch.randn(32, 64, 56, 56, device='cuda')

# Convert to NHWC (channels_last format)
x_nhwc = x_nchw.to(memory_format=torch.channels_last)

# Convert model to channels_last format
model = model.to(memory_format=torch.channels_last)

# Forward pass will auto-select NHWC kernels in cuDNN
output = model(x_nhwc)

# Benchmark comparison
# Measured on H100:
# NCHW: ~12ms/batch
# NHWC: ~9ms/batch  (~25% faster)

7. TensorRT: Beyond cuDNN for Inference

TensorRT is an inference optimization engine built on top of cuDNN. It runs trained models at maximum performance in deployment environments.

TensorRT's optimization pipeline:

Original model (ONNX/PyTorch)
      ↓
  Graph analysis
      ↓
  Layer fusion
  ┌─────────────────────────────────────────────┐
  │ Conv → BN → ReLU → Conv → BN → ReLU        │
  │           ↓ after fusion                    │
  │     single optimized kernel                 │
  └─────────────────────────────────────────────┘
      ↓
  Precision selection (FP32 → FP16 → INT8)
      ↓
  Kernel auto-selection (benchmarked per input size)
      ↓
  Optimized execution engine

Performance across precision levels (ResNet-50, batch=32, H100):

Precision	Latency	Throughput	Accuracy Loss
FP32	4.2ms	7,600 img/s	baseline
FP16	1.8ms	17,800 img/s	negligible
INT8	0.9ms	35,500 img/s	~0.1% Top-1

INT8 quantization with calibration:

import tensorrt as trt

# INT8 calibrator setup
class MyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data_loader, cache_file):
        super().__init__()
        self.data_loader = iter(data_loader)
        self.cache_file = cache_file
        self.batch_allocation = None

    def get_batch_size(self):
        return 32

    def get_batch(self, names):
        try:
            batch = next(self.data_loader)[0].numpy()
            # Calibrate activation ranges using representative dataset
            # TensorRT maps FP32 range of each layer to INT8
            if self.batch_allocation is None:
                self.batch_allocation = cuda.mem_alloc(batch.nbytes)
            cuda.memcpy_htod(self.batch_allocation, batch)
            return [int(self.batch_allocation)]
        except StopIteration:
            return None

# Builder configuration
builder = trt.Builder(logger)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.INT8)  # Enable INT8
config.int8_calibrator = MyCalibrator(calibration_loader, 'cache.bin')

8. Tracing cuDNN Calls in a Real LLM

Let's trace what kernels actually get called when running GPT-2 or LLaMA:

# Trace LLM operations using PyTorch Profiler
import torch
from torch.profiler import profile, ProfilerActivity

model = GPT2Model.from_pretrained('gpt2').cuda()
input_ids = torch.randint(0, 50257, (1, 512)).cuda()

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True
) as prof:
    output = model(input_ids)

# Print top CUDA kernels
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

What actually runs:

GPT-2 Forward Pass Internal CUDA Kernels (512 tokens, batch=1):

Operation       Kernel                         Time   Share
──────────────────────────────────────────────────────────
Linear (QKV)    cublasSgemm (Tensor Core)      2.1ms  35%
Attention       flash_attn_fwd_kernel          1.4ms  23%
Linear (output) cublasSgemm (Tensor Core)      0.9ms  15%
LayerNorm       layer_norm_kernel              0.3ms   5%
GELU            vectorized_elementwise_kernel  0.2ms   3%
Residual Add    vectorized_elementwise_kernel  0.1ms   2%
Embedding       Embedding_cuda                 0.1ms   2%
Other                                         0.9ms  15%

Total: ~6ms (on H100)

Key observations:

Linear layers = cuBLAS GEMM: About 50% of LLM computation is matrix multiplication
Attention = FlashAttention: A custom CUDA kernel, NOT cuDNN
LayerNorm = custom fused kernel: Mean + variance + normalization in one kernel
Activation functions = elementwise kernels: Very fast (memory bandwidth bound)

9. Summary: The Sources of cuDNN's Performance Advantage

Why cuDNN is 10-100x faster than raw CUDA:

Algorithm optimization: Mathematically more efficient algorithms — Winograd, im2col+GEMM — chosen per situation
Kernel fusion: Multiple operations combined into one kernel to minimize HBM round trips
Memory layout optimization: NHWC + Tensor Core activation for maximum hardware efficiency
Auto-Tuner: Benchmarks actual hardware to select the optimal implementation per input size
FlashAttention-style innovations: O(N^2) → O(N) memory access transforms attention computation

Behind a single line of PyTorch lies decades of optimization research. Understanding these internals is the starting point for LLM serving optimization, writing custom CUDA kernels, and making informed hardware decisions.