cuDNN Internals: Why Deep Learning Operations Fly on GPU

Introduction: What Happens Without cuDNN
1. Why cuDNN Exists: Raw CUDA vs Optimized Libraries
2. Four Convolution Algorithms: The Core of cuDNN
3. cuDNN Auto-Tuner: What benchmark Mode Actually Does
4. Batch Normalization and Kernel Fusion: Memory Traffic Reduction
5. Attention and FlashAttention: Going Beyond cuDNN
6. Memory Layout: NCHW vs NHWC
7. TensorRT: Beyond cuDNN for Inference
8. Tracing cuDNN Calls in a Real LLM
9. Summary: The Sources of cuDNN's Performance Advantage

Introduction: What Happens Without cuDNN

When you call torch.nn.Conv2d in PyTorch, do you know what actually happens under the hood? It's not simply "run a convolution on the GPU." cuDNN automatically selects an algorithm, adjusts memory layout, and executes hardware-specific optimized kernels.

Implementing convolution directly with raw CUDA kernels runs 10-100x slower than cuDNN. Understanding exactly why is the goal of this post.

1. Why cuDNN Exists: Raw CUDA vs Optimized Libraries

CUDA is a general-purpose parallel programming model. It can parallelize any computation, but it doesn't include specialized optimizations for the specific operations that dominate deep learning — convolutions, batch normalization, attention.

cuDNN (CUDA Deep Neural Network library) is a hand-tuned kernel library for these operations. The same computation, dramatically different performance:

Naive CUDA convolution:     ~5 TFLOPS achieved (0.5% of H100 theoretical)
cuDNN optimized convolution: ~900 TFLOPS achieved (90% of H100 theoretical)
Difference: ~180x

Why such a gap?

Algorithm selection: Automatically chooses Direct, im2col+GEMM, Winograd, or FFT based on input size
Memory layout optimization: Uses NHWC layout to activate Tensor Cores
Kernel fusion: Combines multiple operations into a single kernel to reduce memory round trips
Warp-level optimization: Fully exploits hardware characteristics — memory coalescing, register reuse

PyTorch and TensorFlow use cuDNN as their foundational library:

import torch

# This code calls cuDNN internally
conv = torch.nn.Conv2d(64, 128, kernel_size=3, padding=1).cuda()
x = torch.randn(32, 64, 56, 56, device='cuda')
y = conv(x)  # → calls cudnnConvolutionForward()

# Verify cuDNN is in use
print(torch.backends.cudnn.enabled)    # True
print(torch.backends.cudnn.version())  # e.g., 8906

2. Four Convolution Algorithms: The Core of cuDNN

Convolution is the most compute-intensive operation in deep learning. cuDNN selects from four algorithms based on the situation.

Algorithm A: Direct Convolution (Naive)

The most straightforward implementation. Computes each output element independently:

For each output pixel (i, j):
  for c_out in range(C_out):
    for c_in in range(C_in):
      for kh in range(K):
        for kw in range(K):
          output[c_out, i, j] += input[c_in, i+kh, j+kw] * filter[c_out, c_in, kh, kw]

Total ops: N × C_out × C_in × K^2 × H_out × W_out

Pros: Simple. Cons: Inefficient memory access patterns. Very slow for large inputs.

Algorithm B: im2col + GEMM (cuDNN Default)

The key trick that transforms convolution into matrix multiplication, enabling cuBLAS's highly optimized GEMM to take over.

The im2col transformation:

Input feature map [3x3 image, 3x3 kernel]:

Original input:        im2col result:
┌─────────┐            ┌────────────────────────────┐
│ 1  2  3 │  im2col    │ 1  2  4  5  (patch 0, pos 0,0) │
│ 4  5  6 │ ─────────→ │ 2  3  5  6  (patch 1, pos 0,1) │
│ 7  8  9 │            │ 4  5  7  8  (patch 2, pos 1,0) │
└─────────┘            │ 5  6  8  9  (patch 3, pos 1,1) │
                       └────────────────────────────┘
Each row = all pixel values in one receptive field

Then: output = filter_matrix × im2col_matrix
      (standard GEMM → Tensor Core fully utilized!)

Downside of im2col: requires additional memory to store the rearranged input (K^2 times the original input size).

# Python illustration of im2col behavior
import numpy as np

def im2col(input, kernel_h, kernel_w, stride=1, pad=0):
    N, C, H, W = input.shape
    out_h = (H + 2*pad - kernel_h) // stride + 1
    out_w = (W + 2*pad - kernel_w) // stride + 1

    # Apply padding
    img = np.pad(input, [(0,0),(0,0),(pad,pad),(pad,pad)], mode='constant')

    # Build im2col matrix: (C*kernel_h*kernel_w) x (N*out_h*out_w)
    col = np.zeros((N, C, kernel_h, kernel_w, out_h, out_w))
    for j in range(kernel_h):
        jj = j + stride * np.arange(out_h)
        for i in range(kernel_w):
            ii = i + stride * np.arange(out_w)
            col[:, :, j, i, :, :] = img[:, :, jj[:, None], ii[None, :]]

    col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
    return col
# cuDNN's actual im2col is far more optimized than this

Algorithm C: Winograd Algorithm (Champion for Small Kernels)

The algorithm cuDNN defaults to for 3x3 convolutions. It applies Winograd's minimal filtering algorithm (1980) to deep learning.

Core idea: Use linear algebra transformations to drastically reduce the number of multiplications.

Standard 3x3 convolution (2x2 output):
- Input patch: 4x4 = 16 elements
- Kernel: 3x3 = 9 elements
- Output: 2x2 = 4 elements
- Multiplications required (naive): 4 x 9 = 36

After Winograd F(2x2, 3x3) transformation:
- Transformed input: 4x4 = 16 elements (linear transform, no multiplications)
- Element-wise multiplication: 16 operations (4x4)
- Inverse transform: 4x4 → 2x2 (linear transform, no multiplications)
- Multiplications required: 16
- Reduction: 36 → 16 = 2.25x fewer multiplications

Expressed mathematically:

Y = A^T [(G × g × G^T) element-wise-multiply (B^T × d × B)] A

Where:
d = input tile (4x4)
g = 3x3 kernel
B, G, A = fixed transform matrices (precomputed constants)
element-wise-multiply = Hadamard product

Key insight: G × g × G^T is computed once per kernel
             (can be precomputed at inference time)
             B^T × d × B is computed per input tile
             Element-wise product gives 16 multiplications
             instead of 36

cuDNN automatically selects Winograd for 3x3, stride=1 convolutions. Since ResNet, VGG, and most CNNs predominantly use 3x3 convolutions, this matters enormously in practice.

Algorithm D: FFT-Based Convolution (Large Kernels)

For large kernels like 7x7 or 11x11, convolution in the frequency domain is efficient.

Spatial domain convolution:    O(N × K^2)   (N = output size, K = kernel size)
Frequency domain convolution:  O(N × log N) (FFT + element-wise multiply)

The larger K is, the more FFT wins:
K=3:  9 ops vs log(N) ≈ 7  → marginal difference
K=11: 121 ops vs log(N) ≈ 7 → FFT uses 17x fewer multiplications

3. cuDNN Auto-Tuner: What `benchmark` Mode Actually Does

# What does this single line actually do?
torch.backends.cudnn.benchmark = True

benchmark=False (default):

cuDNN selects an algorithm heuristically based on input size
Fast from the first run, but may not be the optimal choice

benchmark=True:

On the first forward pass, calls cudnnFindConvolutionForwardAlgorithm()
Actually runs and benchmarks ALL available algorithms for the current input size
Selects the fastest and caches it
All subsequent runs with the same input size use the cached algorithm

# See the benchmark effect directly
import torch
import time

torch.backends.cudnn.benchmark = False
model = resnet50().cuda()
x = torch.randn(32, 3, 224, 224).cuda()

t0 = time.time()
for _ in range(10): y = model(x)
torch.cuda.synchronize()
print(f"benchmark=False: {(time.time()-t0)*1000:.0f}ms")

torch.backends.cudnn.benchmark = True
# First run includes benchmarking (SLOW!)
t0 = time.time()
y = model(x)  # This single call takes several seconds
torch.cuda.synchronize()
print(f"First run (includes benchmarking): {(time.time()-t0)*1000:.0f}ms")

# Subsequent runs use cached optimal algorithm
t0 = time.time()
for _ in range(10): y = model(x)
torch.cuda.synchronize()
print(f"benchmark=True (after warmup): {(time.time()-t0)*1000:.0f}ms")
# Typically 20-40% faster

Caution: If input sizes vary every batch, benchmark=True can actually be slower — it triggers a new benchmarking pass for each new size. Only use it when training/inference input sizes are fixed.

4. Batch Normalization and Kernel Fusion: Memory Traffic Reduction

Batch Normalization looks simple on paper:

y = gamma * (x - mean(x)) / sqrt(var(x) + eps) + beta

But implementing this as separate kernels is catastrophically inefficient:

Naive implementation memory round trips:
1. Load x from HBM
2. Compute mean → store to HBM
3. Load x + load mean (2 HBM reads)
4. Compute variance → store to HBM
5. Load x + load mean + load var (3 HBM reads)
6. Compute normalization → store to HBM
7. ReLU: load from HBM → compute → store to HBM

Total: ~10 HBM accesses

cuDNN fuses BN + ReLU into a single kernel:

Fused BN+ReLU kernel:
1. Load x into registers/shared memory (1 HBM read)
2. Compute mean, variance in registers (zero HBM access)
3. Normalize in registers (zero HBM access)
4. ReLU in registers (zero HBM access)
5. Store result to HBM (1 HBM write)

Total: 2 HBM accesses (5x reduction!)

# Using fused BN+ReLU in PyTorch
import torch.nn as nn

# Naive: Conv → BN → ReLU (3 separate kernel launches)
class NaiveBlock(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.conv = nn.Conv2d(c, c, 3, padding=1)
        self.bn = nn.BatchNorm2d(c)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)    # Separate cuDNN kernel
        x = self.relu(x)  # Separate elementwise kernel
        return x

# With torch.compile(), the compiler automatically fuses these
# through cuDNN Graph API and nvFuser
model = torch.compile(model)  # Auto-fuses conv+bn+relu into one kernel

Modern cuDNN (v8+) supports more complex fusions through its Graph API, and torch.compile() leverages this automatically.

5. Attention and FlashAttention: Going Beyond cuDNN

Standard Transformer attention implemented straightforwardly:

# Standard Attention — memory inefficient
def standard_attention(Q, K, V, scale):
    # S = Q × K^T: creates N×N attention score matrix
    S = torch.matmul(Q, K.transpose(-2, -1)) * scale  # O(N^2) memory!
    # Softmax
    P = torch.softmax(S, dim=-1)
    # Multiply by V
    O = torch.matmul(P, V)
    return O

# Problem: S requires O(N^2) memory
# N=8192 (document length), FP16:
# S size = 8192 * 8192 * 2 bytes = 128MB per head!

FlashAttention solves this with a completely new CUDA kernel that bypasses cuDNN entirely.

FlashAttention's core idea: Never materialize the N×N attention matrix in HBM

Standard attention memory flow:
Q, K, V → read from HBM
S = QK^T → write to HBM (N^2 size!)
softmax(S) → write to HBM
P × V → write to HBM
Total HBM accesses: O(N^2)

FlashAttention memory flow:
Load Q, K, V in tiles → into shared memory
Compute partial attention sums in shared memory
  (never creates full N^2 matrix)
Write only the final output to HBM
Total HBM accesses: O(N)  ← from N^2 down to N!

The tiling trick in detail:

FlashAttention tiling:

Partition Q into blocks Q_1, Q_2, ... Q_Tc
Partition K, V into blocks K_1, V_1, ... K_Tr, V_Tr

for i in range(Tc):
  Load Q_i into SRAM
  O_i = 0, l_i = 0, m_i = -inf  (softmax statistics init)

  for j in range(Tr):
    Load K_j, V_j into SRAM

    # Compute entirely in SRAM (zero HBM access)
    S_ij = Q_i × K_j^T  (tile-sized attention scores)
    m_ij = max(m_i, rowmax(S_ij))  (numerical stability)
    P_ij = exp(S_ij - m_ij)

    # Online softmax update (numerically stable!)
    O_i = diag(exp(m_i - m_ij)) × O_i + P_ij × V_j
    l_i = exp(m_i - m_ij) × l_i + rowsum(P_ij)
    m_i = m_ij

  # Final normalization, then write to HBM (1 write)
  O_i = diag(l_i)^(-1) × O_i → store to HBM

Results:

Memory usage: O(N^2) → O(N) (saves hundreds of MB at N=8192)
Speed: 2-4x faster (thanks to reduced HBM traffic)

6. Memory Layout: NCHW vs NHWC

The memory layout of deep learning tensors has a decisive impact on performance.

NCHW layout (batch × channel × height × width):
Batch=1, channels=3 (RGB), 4x4 image:

Memory arrangement:
[R00 R01 R02 R03 | R10 R11 ... | R30 R31 R32 R33 |
 G00 G01 G02 G03 | G10 G11 ... | G30 G31 G32 G33 |
 B00 B01 B02 B03 | B10 B11 ... | B30 B31 B32 B33]

→ Pixels within the same channel are contiguous

NHWC layout (batch × height × width × channel):
Memory arrangement:
[R00 G00 B00 | R01 G01 B01 | R02 G02 B02 | R03 G03 B03 |
 R10 G10 B10 | R11 G11 B11 | ...
 R30 G30 B30 | R31 G31 B31 | R32 G32 B32 | R33 G33 B33]

→ All channels at the same spatial position are contiguous

Why Tensor Cores prefer NHWC:

Tensor Cores process 16x16 matrix tiles. Building a 16x16 tile for convolution under NCHW results in discontinuous memory accesses across channels. Under NHWC, channels are contiguous, making tile loading a sequential memory read.

# Using NHWC (channels_last) in PyTorch
x_nchw = torch.randn(32, 64, 56, 56, device='cuda')

# Convert to NHWC (channels_last format)
x_nhwc = x_nchw.to(memory_format=torch.channels_last)

# Convert model to channels_last format
model = model.to(memory_format=torch.channels_last)

# Forward pass will auto-select NHWC kernels in cuDNN
output = model(x_nhwc)

# Benchmark comparison
# Measured on H100:
# NCHW: ~12ms/batch
# NHWC: ~9ms/batch  (~25% faster)

7. TensorRT: Beyond cuDNN for Inference

TensorRT is an inference optimization engine built on top of cuDNN. It runs trained models at maximum performance in deployment environments.

TensorRT's optimization pipeline:

Original model (ONNX/PyTorch)
      ↓
  Graph analysis
      ↓
  Layer fusion
  ┌─────────────────────────────────────────────┐
  │ Conv → BN → ReLU → Conv → BN → ReLU        │
  │           ↓ after fusion                    │
  │     single optimized kernel                 │
  └─────────────────────────────────────────────┘
      ↓
  Precision selection (FP32 → FP16 → INT8)
      ↓
  Kernel auto-selection (benchmarked per input size)
      ↓
  Optimized execution engine

Performance across precision levels (ResNet-50, batch=32, H100):

Precision	Latency	Throughput	Accuracy Loss
FP32	4.2ms	7,600 img/s	baseline
FP16	1.8ms	17,800 img/s	negligible
INT8	0.9ms	35,500 img/s	~0.1% Top-1

INT8 quantization with calibration:

import tensorrt as trt

# INT8 calibrator setup
class MyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data_loader, cache_file):
        super().__init__()
        self.data_loader = iter(data_loader)
        self.cache_file = cache_file
        self.batch_allocation = None

    def get_batch_size(self):
        return 32

    def get_batch(self, names):
        try:
            batch = next(self.data_loader)[0].numpy()
            # Calibrate activation ranges using representative dataset
            # TensorRT maps FP32 range of each layer to INT8
            if self.batch_allocation is None:
                self.batch_allocation = cuda.mem_alloc(batch.nbytes)
            cuda.memcpy_htod(self.batch_allocation, batch)
            return [int(self.batch_allocation)]
        except StopIteration:
            return None

# Builder configuration
builder = trt.Builder(logger)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.INT8)  # Enable INT8
config.int8_calibrator = MyCalibrator(calibration_loader, 'cache.bin')

8. Tracing cuDNN Calls in a Real LLM

Let's trace what kernels actually get called when running GPT-2 or LLaMA:

# Trace LLM operations using PyTorch Profiler
import torch
from torch.profiler import profile, ProfilerActivity

model = GPT2Model.from_pretrained('gpt2').cuda()
input_ids = torch.randint(0, 50257, (1, 512)).cuda()

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True
) as prof:
    output = model(input_ids)

# Print top CUDA kernels
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

What actually runs:

GPT-2 Forward Pass Internal CUDA Kernels (512 tokens, batch=1):

Operation       Kernel                         Time   Share
──────────────────────────────────────────────────────────
Linear (QKV)    cublasSgemm (Tensor Core)      2.1ms  35%
Attention       flash_attn_fwd_kernel          1.4ms  23%
Linear (output) cublasSgemm (Tensor Core)      0.9ms  15%
LayerNorm       layer_norm_kernel              0.3ms   5%
GELU            vectorized_elementwise_kernel  0.2ms   3%
Residual Add    vectorized_elementwise_kernel  0.1ms   2%
Embedding       Embedding_cuda                 0.1ms   2%
Other                                         0.9ms  15%

Total: ~6ms (on H100)

Key observations:

Linear layers = cuBLAS GEMM: About 50% of LLM computation is matrix multiplication
Attention = FlashAttention: A custom CUDA kernel, NOT cuDNN
LayerNorm = custom fused kernel: Mean + variance + normalization in one kernel
Activation functions = elementwise kernels: Very fast (memory bandwidth bound)

9. Summary: The Sources of cuDNN's Performance Advantage

Why cuDNN is 10-100x faster than raw CUDA:

Algorithm optimization: Mathematically more efficient algorithms — Winograd, im2col+GEMM — chosen per situation
Kernel fusion: Multiple operations combined into one kernel to minimize HBM round trips
Memory layout optimization: NHWC + Tensor Core activation for maximum hardware efficiency
Auto-Tuner: Benchmarks actual hardware to select the optimal implementation per input size
FlashAttention-style innovations: O(N^2) → O(N) memory access transforms attention computation

Behind a single line of PyTorch lies decades of optimization research. Understanding these internals is the starting point for LLM serving optimization, writing custom CUDA kernels, and making informed hardware decisions.