- Published on
CNN Architecture Complete Guide: From LeNet to EfficientNet and Vision Transformers
- Authors

- Name
- Youngju Kim
- @fjvbn20031
CNN Architecture Complete Guide
Convolutional Neural Networks (CNNs) are the backbone of the computer vision revolution. From LeNet in 1998 to ConvNeXt and Vision Transformers in 2022, CNN architectures have evolved at a remarkable pace. This guide walks through the structural innovations behind each major CNN architecture and teaches you how to implement them in PyTorch.
1. CNN Fundamentals
Understanding Convolution Intuitively
Convolution is an operation that extracts local patterns from an image. A small filter (kernel) slides over the image to produce a feature map.
Input image (5x5) Kernel (3x3) Output feature map (3x3)
1 1 1 0 0 1 0 1 4 3 4
0 1 1 1 0 * 0 1 0 = 2 4 3
0 0 1 1 1 1 0 1 2 3 4
0 0 1 1 0
0 1 1 0 0
At each position, the output is the sum of element-wise products between the kernel and the image patch.
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
# Visualize convolution
def visualize_convolution():
image = torch.tensor([[
[1., 1., 1., 0., 0.],
[0., 1., 1., 1., 0.],
[0., 0., 1., 1., 1.],
[0., 0., 1., 1., 0.],
[0., 1., 1., 0., 0.]
]]).unsqueeze(0) # (1, 1, 5, 5)
# Edge detection kernel
edge_kernel = torch.tensor([[
[[-1., -1., -1.],
[-1., 8., -1.],
[-1., -1., -1.]]
]]) # (1, 1, 3, 3)
output = F.conv2d(image, edge_kernel, padding=1)
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
axes[0].imshow(image[0, 0].numpy(), cmap='gray')
axes[0].set_title('Input Image')
axes[1].imshow(edge_kernel[0, 0].numpy(), cmap='RdYlBu')
axes[1].set_title('Edge Detection Kernel')
axes[2].imshow(output[0, 0].detach().numpy(), cmap='gray')
axes[2].set_title('Output Feature Map')
plt.tight_layout()
plt.show()
Kernel, Stride, and Padding
import torch
import torch.nn as nn
# Basic Conv2d parameters
conv = nn.Conv2d(
in_channels=3, # number of input channels (RGB=3)
out_channels=64, # number of output channels (number of filters)
kernel_size=3, # kernel size (3x3)
stride=1, # stride
padding=1, # padding (same padding)
bias=True
)
# Output size formula
# H_out = floor((H_in + 2*padding - kernel_size) / stride + 1)
def calc_output_size(input_size, kernel_size, stride, padding):
return (input_size + 2 * padding - kernel_size) // stride + 1
print(calc_output_size(224, 3, 1, 1)) # 224 (same padding)
print(calc_output_size(224, 3, 2, 1)) # 112 (stride 2, halves size)
print(calc_output_size(224, 7, 2, 3)) # 112 (AlexNet first layer)
# Parameter count
# Conv2d: (kernel_h * kernel_w * in_channels + 1) * out_channels
params = (3 * 3 * 3 + 1) * 64
print(f"Conv(3->64, 3x3) parameters: {params:,}") # 1,792
Pooling (Max, Average, Global)
import torch
import torch.nn as nn
x = torch.randn(1, 64, 28, 28)
# Max Pooling
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
out_max = max_pool(x) # (1, 64, 14, 14)
# Average Pooling
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
out_avg = avg_pool(x) # (1, 64, 14, 14)
# Global Average Pooling (GAP) - collapses spatial dimensions to 1x1
gap = nn.AdaptiveAvgPool2d(1)
out_gap = gap(x) # (1, 64, 1, 1)
out_gap_flat = out_gap.flatten(1) # (1, 64)
# Adaptive Pooling - specify output size
adaptive = nn.AdaptiveAvgPool2d((7, 7))
out_adaptive = adaptive(x) # (1, 64, 7, 7) regardless of input size
print(f"Input: {x.shape}")
print(f"MaxPool: {out_max.shape}")
print(f"GAP: {out_gap_flat.shape}")
Receptive Field Calculation
def calculate_receptive_field(layers):
"""
Calculate receptive field for each layer.
layers: list of (kernel_size, stride, dilation)
"""
rf = 1
jump = 1
for k, s, d in layers:
effective_k = d * (k - 1) + 1
rf = rf + (effective_k - 1) * jump
jump = jump * s
return rf
# VGG-style (3x3 convolutions only)
vgg_layers = [
(3, 1, 1), # conv1
(3, 1, 1), # conv2
(2, 2, 1), # pool
(3, 1, 1), # conv3
(3, 1, 1), # conv4
(2, 2, 1), # pool
]
rf = calculate_receptive_field(vgg_layers)
print(f"Receptive field after 6 VGG layers: {rf}x{rf} pixels")
# Note: two 3x3 convs = same receptive field as one 5x5
# But parameters are 2*(9*C^2) vs 25*C^2 — two 3x3s use 28% fewer params
2. CNN Architecture History
LeNet-5 (1998, LeCun) — The First Practical CNN
LeNet-5, developed by Yann LeCun in 1998, was the first practical CNN, designed for handwritten digit recognition (MNIST).
Architecture: Input(32x32) -> C1(conv, 6@28x28) -> S2(pool, 6@14x14) -> C3(conv, 16@10x10) -> S4(pool, 16@5x5) -> C5(conv, 120@1x1) -> F6(fc, 84) -> Output(10)
import torch
import torch.nn as nn
class LeNet5(nn.Module):
"""LeNet-5 (with ReLU added to original)"""
def __init__(self, num_classes=10):
super(LeNet5, self).__init__()
self.features = nn.Sequential(
# C1: 1@32x32 -> 6@28x28
nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
nn.Tanh(),
# S2: 6@28x28 -> 6@14x14
nn.AvgPool2d(kernel_size=2, stride=2),
# C3: 6@14x14 -> 16@10x10
nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
nn.Tanh(),
# S4: 16@10x10 -> 16@5x5
nn.AvgPool2d(kernel_size=2, stride=2),
# C5: 16@5x5 -> 120@1x1
nn.Conv2d(16, 120, kernel_size=5, stride=1, padding=0),
nn.Tanh(),
)
self.classifier = nn.Sequential(
nn.Linear(120, 84),
nn.Tanh(),
nn.Linear(84, num_classes)
)
def forward(self, x):
x = self.features(x)
x = x.flatten(1)
x = self.classifier(x)
return x
model = LeNet5(num_classes=10)
x = torch.randn(4, 1, 32, 32)
out = model(x)
print(f"LeNet-5 output: {out.shape}") # (4, 10)
total_params = sum(p.numel() for p in model.parameters())
print(f"LeNet-5 total parameters: {total_params:,}") # ~60,000
AlexNet (2012, Krizhevsky) — The Deep Learning Renaissance
AlexNet won the 2012 ImageNet competition with a top-5 error rate of 15.3%, obliterating the previous best of 26.2% and launching the deep learning era.
Key innovations:
- ReLU activation (6x faster training than Tanh)
- Dropout (0.5) to prevent overfitting
- Data augmentation (crops, flips)
- Local Response Normalization (LRN)
- Dual-GPU training
import torch
import torch.nn as nn
class AlexNet(nn.Module):
"""AlexNet implementation"""
def __init__(self, num_classes=1000):
super(AlexNet, self).__init__()
self.features = nn.Sequential(
# Layer 1: 3@224x224 -> 96@55x55
nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
nn.ReLU(inplace=True),
nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2),
nn.MaxPool2d(kernel_size=3, stride=2), # 96@27x27
# Layer 2: 96@27x27 -> 256@27x27
nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2),
nn.ReLU(inplace=True),
nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2),
nn.MaxPool2d(kernel_size=3, stride=2), # 256@13x13
# Layer 3
nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
# Layer 4
nn.Conv2d(384, 384, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
# Layer 5: 384@13x13 -> 256@13x13
nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2), # 256@6x6
)
self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
self.classifier = nn.Sequential(
nn.Dropout(p=0.5),
nn.Linear(256 * 6 * 6, 4096),
nn.ReLU(inplace=True),
nn.Dropout(p=0.5),
nn.Linear(4096, 4096),
nn.ReLU(inplace=True),
nn.Linear(4096, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.avgpool(x)
x = x.flatten(1)
x = self.classifier(x)
return x
model = AlexNet(num_classes=1000)
x = torch.randn(4, 3, 224, 224)
out = model(x)
print(f"AlexNet output: {out.shape}") # (4, 1000)
total_params = sum(p.numel() for p in model.parameters())
print(f"AlexNet parameters: {total_params:,}") # ~61M
VGGNet (2014, Simonyan) — The Power of Depth
VGGNet from Oxford's Visual Geometry Group uses exclusively 3x3 kernels throughout, allowing dramatically increased depth.
Why 3x3?
- Two 3x3 convolutions = same receptive field as one 5x5 (saves 28% of parameters)
- Three 3x3 convolutions = same receptive field as one 7x7 (saves 45% of parameters)
- More non-linear transformations increase representational capacity
import torch
import torch.nn as nn
from typing import List, Union
class VGG(nn.Module):
"""General VGG implementation"""
def __init__(self, features: nn.Module, num_classes: int = 1000, dropout: float = 0.5):
super(VGG, self).__init__()
self.features = features
self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
self.classifier = nn.Sequential(
nn.Linear(512 * 7 * 7, 4096),
nn.ReLU(inplace=True),
nn.Dropout(p=dropout),
nn.Linear(4096, 4096),
nn.ReLU(inplace=True),
nn.Dropout(p=dropout),
nn.Linear(4096, num_classes)
)
self._initialize_weights()
def forward(self, x):
x = self.features(x)
x = self.avgpool(x)
x = x.flatten(1)
x = self.classifier(x)
return x
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight, 0, 0.01)
nn.init.constant_(m.bias, 0)
def make_layers(cfg: List[Union[str, int]], batch_norm: bool = False) -> nn.Sequential:
layers: List[nn.Module] = []
in_channels = 3
for v in cfg:
if v == 'M':
layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
else:
v = int(v)
conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
in_channels = v
return nn.Sequential(*layers)
cfgs = {
'vgg16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
'vgg19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}
def vgg16(num_classes=1000):
return VGG(make_layers(cfgs['vgg16'], batch_norm=True), num_classes=num_classes)
model_vgg16 = vgg16()
x = torch.randn(2, 3, 224, 224)
out = model_vgg16(x)
print(f"VGG-16 output: {out.shape}")
params = sum(p.numel() for p in model_vgg16.parameters())
print(f"VGG-16 parameters: {params:,}") # ~138M
GoogLeNet/Inception (2014, Szegedy) — Multi-scale Parallel Processing
The Inception module's key idea is to process different kernel sizes (1x1, 3x3, 5x5) in parallel, capturing features at multiple scales simultaneously.
import torch
import torch.nn as nn
class InceptionModule(nn.Module):
"""Basic Inception module"""
def __init__(self, in_channels, n1x1, n3x3_reduce, n3x3,
n5x5_reduce, n5x5, pool_proj):
super(InceptionModule, self).__init__()
# 1x1 branch
self.branch1 = nn.Sequential(
nn.Conv2d(in_channels, n1x1, kernel_size=1),
nn.BatchNorm2d(n1x1),
nn.ReLU(inplace=True)
)
# 1x1 bottleneck + 3x3
self.branch2 = nn.Sequential(
nn.Conv2d(in_channels, n3x3_reduce, kernel_size=1),
nn.BatchNorm2d(n3x3_reduce),
nn.ReLU(inplace=True),
nn.Conv2d(n3x3_reduce, n3x3, kernel_size=3, padding=1),
nn.BatchNorm2d(n3x3),
nn.ReLU(inplace=True)
)
# 1x1 bottleneck + 5x5
self.branch3 = nn.Sequential(
nn.Conv2d(in_channels, n5x5_reduce, kernel_size=1),
nn.BatchNorm2d(n5x5_reduce),
nn.ReLU(inplace=True),
nn.Conv2d(n5x5_reduce, n5x5, kernel_size=5, padding=2),
nn.BatchNorm2d(n5x5),
nn.ReLU(inplace=True)
)
# MaxPool + 1x1
self.branch4 = nn.Sequential(
nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
nn.Conv2d(in_channels, pool_proj, kernel_size=1),
nn.BatchNorm2d(pool_proj),
nn.ReLU(inplace=True)
)
def forward(self, x):
b1 = self.branch1(x)
b2 = self.branch2(x)
b3 = self.branch3(x)
b4 = self.branch4(x)
return torch.cat([b1, b2, b3, b4], dim=1)
module = InceptionModule(192, 64, 96, 128, 16, 32, 32)
x = torch.randn(2, 192, 28, 28)
out = module(x)
print(f"Inception output: {out.shape}") # (2, 256, 28, 28)
ResNet (2015, He) — Solving the Vanishing Gradient with Residual Connections
ResNet, introduced by He Kaiming in 2015, uses skip connections to allow gradients to flow through very deep networks, enabling training of networks with 152 layers.
Core idea: H(x) = F(x) + x
Instead of learning H(x) directly, each layer learns the residual F(x) = H(x) - x. When the optimal mapping is close to the identity, driving F(x) toward zero is much easier.
import torch
import torch.nn as nn
from typing import Optional, Type, List
class BasicBlock(nn.Module):
"""Basic block for ResNet-18/34"""
expansion = 1
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
super(BasicBlock, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.downsample = downsample
def forward(self, x):
identity = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
if self.downsample is not None:
identity = self.downsample(x)
out += identity # The residual connection
out = self.relu(out)
return out
class Bottleneck(nn.Module):
"""Bottleneck block for ResNet-50/101/152"""
expansion = 4
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
super(Bottleneck, self).__init__()
# 1x1 (reduce channels)
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
# 3x3 (spatial processing)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
# 1x1 (expand channels: out_channels * 4)
self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion,
kernel_size=1, bias=False)
self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
def forward(self, x):
identity = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class ResNet(nn.Module):
"""Complete ResNet implementation"""
def __init__(self, block, layers, num_classes=1000):
super(ResNet, self).__init__()
self.in_channels = 64
# Stem
self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
# 4 stages
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
# Classifier
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512 * block.expansion, num_classes)
self._initialize_weights()
def _make_layer(self, block, out_channels, blocks, stride=1):
downsample = None
if stride != 1 or self.in_channels != out_channels * block.expansion:
downsample = nn.Sequential(
nn.Conv2d(self.in_channels, out_channels * block.expansion,
kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels * block.expansion)
)
layers = [block(self.in_channels, out_channels, stride, downsample)]
self.in_channels = out_channels * block.expansion
for _ in range(1, blocks):
layers.append(block(self.in_channels, out_channels))
return nn.Sequential(*layers)
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
def forward(self, x):
x = self.maxpool(self.relu(self.bn1(self.conv1(x))))
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = x.flatten(1)
x = self.fc(x)
return x
def resnet18(num_classes=1000):
return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)
def resnet34(num_classes=1000):
return ResNet(BasicBlock, [3, 4, 6, 3], num_classes)
def resnet50(num_classes=1000):
return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)
def resnet101(num_classes=1000):
return ResNet(Bottleneck, [3, 4, 23, 3], num_classes)
def resnet152(num_classes=1000):
return ResNet(Bottleneck, [3, 8, 36, 3], num_classes)
# Test
for name, model_fn in [('ResNet-18', resnet18), ('ResNet-50', resnet50)]:
model = model_fn()
x = torch.randn(2, 3, 224, 224)
out = model(x)
params = sum(p.numel() for p in model.parameters())
print(f"{name}: output={out.shape}, params={params:,}")
DenseNet (2017, Huang) — Dense Connectivity
DenseNet connects each layer to every previous layer. With L layers, ResNet has L connections but DenseNet has L(L+1)/2 connections.
import torch
import torch.nn as nn
import torch.nn.functional as F
class DenseLayer(nn.Module):
"""A single DenseNet layer"""
def __init__(self, in_channels, growth_rate, bn_size=4, drop_rate=0.0):
super(DenseLayer, self).__init__()
# Bottleneck: 1x1 conv to limit channels
self.norm1 = nn.BatchNorm2d(in_channels)
self.relu1 = nn.ReLU(inplace=True)
self.conv1 = nn.Conv2d(in_channels, bn_size * growth_rate, kernel_size=1, bias=False)
# 3x3 conv
self.norm2 = nn.BatchNorm2d(bn_size * growth_rate)
self.relu2 = nn.ReLU(inplace=True)
self.conv2 = nn.Conv2d(bn_size * growth_rate, growth_rate,
kernel_size=3, padding=1, bias=False)
self.drop_rate = drop_rate
def forward(self, x):
if isinstance(x, torch.Tensor):
prev_features = [x]
else:
prev_features = x
# Concat all previous feature maps
concat_input = torch.cat(prev_features, dim=1)
out = self.conv1(self.relu1(self.norm1(concat_input)))
out = self.conv2(self.relu2(self.norm2(out)))
if self.drop_rate > 0:
out = F.dropout(out, p=self.drop_rate, training=self.training)
return out
class DenseBlock(nn.Module):
"""Dense Block composed of multiple DenseLayers"""
def __init__(self, num_layers, in_channels, growth_rate, bn_size=4, drop_rate=0.0):
super(DenseBlock, self).__init__()
self.layers = nn.ModuleList()
for i in range(num_layers):
layer = DenseLayer(
in_channels + i * growth_rate,
growth_rate, bn_size, drop_rate
)
self.layers.append(layer)
def forward(self, x):
features = [x]
for layer in self.layers:
new_feat = layer(features)
features.append(new_feat)
return torch.cat(features, dim=1)
class TransitionLayer(nn.Module):
"""Transition layer between Dense Blocks (compression + downsampling)"""
def __init__(self, in_channels, out_channels):
super(TransitionLayer, self).__init__()
self.norm = nn.BatchNorm2d(in_channels)
self.relu = nn.ReLU(inplace=True)
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
self.pool = nn.AvgPool2d(kernel_size=2, stride=2)
def forward(self, x):
return self.pool(self.conv(self.relu(self.norm(x))))
MobileNet (2017) — Lightweight for Edge Devices
MobileNet introduced Depthwise Separable Convolutions, drastically reducing computation while maintaining accuracy — ideal for mobile and edge deployment.
import torch
import torch.nn as nn
class DepthwiseSeparableConv(nn.Module):
"""Depthwise Separable Convolution"""
def __init__(self, in_channels, out_channels, stride=1):
super(DepthwiseSeparableConv, self).__init__()
# Depthwise: process each input channel independently
self.depthwise = nn.Sequential(
nn.Conv2d(in_channels, in_channels, kernel_size=3,
stride=stride, padding=1, groups=in_channels, bias=False),
nn.BatchNorm2d(in_channels),
nn.ReLU6(inplace=True)
)
# Pointwise: 1x1 conv to combine channels
self.pointwise = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU6(inplace=True)
)
def forward(self, x):
x = self.depthwise(x)
x = self.pointwise(x)
return x
class InvertedResidual(nn.Module):
"""MobileNetV2 inverted residual block"""
def __init__(self, in_channels, out_channels, stride, expand_ratio):
super(InvertedResidual, self).__init__()
self.stride = stride
hidden_dim = int(in_channels * expand_ratio)
self.use_res_connect = (stride == 1 and in_channels == out_channels)
layers = []
if expand_ratio != 1:
layers += [
nn.Conv2d(in_channels, hidden_dim, 1, bias=False),
nn.BatchNorm2d(hidden_dim),
nn.ReLU6(inplace=True)
]
layers += [
nn.Conv2d(hidden_dim, hidden_dim, 3, stride=stride,
padding=1, groups=hidden_dim, bias=False),
nn.BatchNorm2d(hidden_dim),
nn.ReLU6(inplace=True),
nn.Conv2d(hidden_dim, out_channels, 1, bias=False),
nn.BatchNorm2d(out_channels)
]
self.conv = nn.Sequential(*layers)
def forward(self, x):
if self.use_res_connect:
return x + self.conv(x)
else:
return self.conv(x)
# Parameter savings
standard_conv_params = 3 * 3 * 512 * 512 # standard convolution
dw_sep_params = (3 * 3 * 512) + (512 * 512) # depthwise separable
print(f"Standard conv: {standard_conv_params:,}")
print(f"Depthwise Separable: {dw_sep_params:,}")
print(f"Savings: {(1 - dw_sep_params/standard_conv_params):.1%}")
EfficientNet (2019, Tan) — Compound Scaling
EfficientNet proposes scaling width, depth, and resolution together using a compound coefficient, achieving the best accuracy-efficiency tradeoff at the time.
# EfficientNet scaling coefficients
efficientnet_params = {
'b0': (1.0, 1.0, 224, 0.2),
'b1': (1.0, 1.1, 240, 0.2),
'b2': (1.1, 1.2, 260, 0.3),
'b3': (1.2, 1.4, 300, 0.3),
'b4': (1.4, 1.8, 380, 0.4),
'b5': (1.6, 2.2, 456, 0.4),
'b6': (1.8, 2.6, 528, 0.5),
'b7': (2.0, 3.1, 600, 0.5),
}
# (width_coeff, depth_coeff, resolution, dropout_rate)
print("EfficientNet scaling parameters:")
for version, (w, d, r, drop) in efficientnet_params.items():
print(f" B{version[1]}: width={w:.1f}, depth={d:.1f}, res={r}, dropout={drop}")
ConvNeXt (2022, Liu) — A ConvNet for the 2020s
ConvNeXt modernizes the CNN design space by importing ideas from Vision Transformers — large kernels, LayerNorm, GELU, and inverted bottlenecks — achieving Transformer-competitive performance.
import torch
import torch.nn as nn
class ConvNeXtBlock(nn.Module):
"""ConvNeXt block"""
def __init__(self, dim, layer_scale_init_value=1e-6):
super(ConvNeXtBlock, self).__init__()
# Depthwise Conv with large kernel (7x7)
self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
# LayerNorm
self.norm = nn.LayerNorm(dim, eps=1e-6)
# Inverted Bottleneck (4x channel expansion)
self.pwconv1 = nn.Linear(dim, 4 * dim)
self.act = nn.GELU()
self.pwconv2 = nn.Linear(4 * dim, dim)
# Layer Scale
self.gamma = nn.Parameter(
layer_scale_init_value * torch.ones(dim),
requires_grad=True
) if layer_scale_init_value > 0 else None
def forward(self, x):
identity = x
x = self.dwconv(x)
# (N, C, H, W) -> (N, H, W, C) for LayerNorm
x = x.permute(0, 2, 3, 1)
x = self.norm(x)
x = self.pwconv1(x)
x = self.act(x)
x = self.pwconv2(x)
if self.gamma is not None:
x = self.gamma * x
# (N, H, W, C) -> (N, C, H, W)
x = x.permute(0, 3, 1, 2)
return identity + x
3. Vision Transformer (ViT)
ViT splits images into patches and applies a Transformer, treating each patch as a token — a fundamentally different paradigm from traditional CNNs.
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
"""Convert image to patch embeddings"""
def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768):
super(PatchEmbedding, self).__init__()
self.num_patches = (image_size // patch_size) ** 2
# Single convolution performs patch extraction and embedding
self.projection = nn.Conv2d(
in_channels, embed_dim,
kernel_size=patch_size, stride=patch_size
)
def forward(self, x):
x = self.projection(x) # (B, embed_dim, H/p, W/p)
x = x.flatten(2) # (B, embed_dim, num_patches)
x = x.transpose(1, 2) # (B, num_patches, embed_dim)
return x
class MultiHeadSelfAttention(nn.Module):
"""Multi-head self-attention"""
def __init__(self, embed_dim, num_heads, dropout=0.0):
super(MultiHeadSelfAttention, self).__init__()
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5
self.qkv = nn.Linear(embed_dim, embed_dim * 3)
self.proj = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0)
attn = (q @ k.transpose(-2, -1)) * self.scale
attn = attn.softmax(dim=-1)
attn = self.dropout(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
return x
class TransformerBlock(nn.Module):
"""Transformer block"""
def __init__(self, embed_dim, num_heads, mlp_ratio=4.0, dropout=0.0):
super(TransformerBlock, self).__init__()
self.norm1 = nn.LayerNorm(embed_dim)
self.attn = MultiHeadSelfAttention(embed_dim, num_heads, dropout)
self.norm2 = nn.LayerNorm(embed_dim)
mlp_hidden = int(embed_dim * mlp_ratio)
self.mlp = nn.Sequential(
nn.Linear(embed_dim, mlp_hidden),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(mlp_hidden, embed_dim),
nn.Dropout(dropout)
)
def forward(self, x):
x = x + self.attn(self.norm1(x)) # residual
x = x + self.mlp(self.norm2(x)) # residual
return x
class VisionTransformer(nn.Module):
"""Vision Transformer (ViT)"""
def __init__(self, image_size=224, patch_size=16, in_channels=3,
num_classes=1000, embed_dim=768, depth=12, num_heads=12,
mlp_ratio=4.0, dropout=0.0):
super(VisionTransformer, self).__init__()
self.patch_embed = PatchEmbedding(image_size, patch_size, in_channels, embed_dim)
num_patches = self.patch_embed.num_patches
# CLS token + positional embedding
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embedding = nn.Parameter(
torch.zeros(1, num_patches + 1, embed_dim)
)
self.pos_dropout = nn.Dropout(dropout)
# Transformer blocks
self.blocks = nn.Sequential(*[
TransformerBlock(embed_dim, num_heads, mlp_ratio, dropout)
for _ in range(depth)
])
self.norm = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, num_classes)
self._init_weights()
def _init_weights(self):
nn.init.trunc_normal_(self.pos_embedding, std=0.02)
nn.init.trunc_normal_(self.cls_token, std=0.02)
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.trunc_normal_(m.weight, std=0.02)
if m.bias is not None:
nn.init.zeros_(m.bias)
def forward(self, x):
B = x.shape[0]
x = self.patch_embed(x) # (B, num_patches, embed_dim)
cls_tokens = self.cls_token.expand(B, -1, -1)
x = torch.cat([cls_tokens, x], dim=1) # prepend CLS token
x = x + self.pos_embedding
x = self.pos_dropout(x)
x = self.blocks(x)
x = self.norm(x)
cls_output = x[:, 0]
return self.head(cls_output)
def vit_base(num_classes=1000):
return VisionTransformer(
image_size=224, patch_size=16, embed_dim=768, depth=12,
num_heads=12, num_classes=num_classes
)
model = vit_base()
x = torch.randn(2, 3, 224, 224)
out = model(x)
params = sum(p.numel() for p in model.parameters())
print(f"ViT-Base output: {out.shape}, parameters: {params:,}")
4. Object Detection: YOLO
import torch
import torch.nn as nn
class YOLOHead(nn.Module):
"""Simplified YOLO detection head"""
def __init__(self, in_channels, num_anchors, num_classes):
super(YOLOHead, self).__init__()
self.num_anchors = num_anchors
self.num_classes = num_classes
# Predict: (x, y, w, h, objectness, num_classes) * num_anchors
out_channels = num_anchors * (5 + num_classes)
self.head = nn.Sequential(
nn.Conv2d(in_channels, in_channels * 2, kernel_size=3, padding=1),
nn.BatchNorm2d(in_channels * 2),
nn.LeakyReLU(0.1),
nn.Conv2d(in_channels * 2, out_channels, kernel_size=1)
)
def forward(self, x):
out = self.head(x)
B, C, H, W = out.shape
out = out.reshape(B, self.num_anchors, 5 + self.num_classes, H, W)
out = out.permute(0, 1, 3, 4, 2).contiguous()
return out
5. Image Segmentation: U-Net
import torch
import torch.nn as nn
import torch.nn.functional as F
class DoubleConv(nn.Module):
"""U-Net double convolution block"""
def __init__(self, in_channels, out_channels):
super(DoubleConv, self).__init__()
self.double_conv = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
return self.double_conv(x)
class UNet(nn.Module):
"""U-Net for medical image segmentation"""
def __init__(self, in_channels=1, num_classes=2, features=[64, 128, 256, 512]):
super(UNet, self).__init__()
self.encoders = nn.ModuleList()
self.decoders = nn.ModuleList()
self.pool = nn.MaxPool2d(2, 2)
# Encoder path
for feature in features:
self.encoders.append(DoubleConv(in_channels, feature))
in_channels = feature
# Bottleneck
self.bottleneck = DoubleConv(features[-1], features[-1] * 2)
# Decoder path
for feature in reversed(features):
self.decoders.append(
nn.ConvTranspose2d(feature * 2, feature, kernel_size=2, stride=2)
)
self.decoders.append(DoubleConv(feature * 2, feature))
self.final_conv = nn.Conv2d(features[0], num_classes, kernel_size=1)
def forward(self, x):
skip_connections = []
# Encoder
for encoder in self.encoders:
x = encoder(x)
skip_connections.append(x)
x = self.pool(x)
x = self.bottleneck(x)
skip_connections = skip_connections[::-1]
# Decoder
for i in range(0, len(self.decoders), 2):
x = self.decoders[i](x)
skip = skip_connections[i // 2]
if x.shape != skip.shape:
x = F.interpolate(x, size=skip.shape[2:])
x = torch.cat([skip, x], dim=1) # Skip connection
x = self.decoders[i + 1](x)
return self.final_conv(x)
model = UNet(in_channels=1, num_classes=2)
x = torch.randn(4, 1, 572, 572)
out = model(x)
print(f"U-Net output: {out.shape}") # (4, 2, 572, 572)
6. Transfer Learning in Practice
Using torchvision.models
import torch
import torch.nn as nn
import torchvision.models as models
import torch.optim as optim
from tqdm import tqdm
# Load pretrained models
model_resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model_efficientnet = models.efficientnet_b4(weights=models.EfficientNet_B4_Weights.DEFAULT)
model_vit = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)
def feature_extraction(num_classes, freeze=True):
"""Feature extraction: freeze backbone, train only classifier"""
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
if freeze:
for param in model.parameters():
param.requires_grad = False
# Replace classifier
in_features = model.fc.in_features
model.fc = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(in_features, 256),
nn.ReLU(),
nn.Linear(256, num_classes)
)
for param in model.fc.parameters():
param.requires_grad = True
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable:,} / {total:,} ({trainable/total:.1%})")
return model
def fine_tuning(num_classes, unfreeze_layers=2):
"""Fine-tuning: unfreeze last few layers"""
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
for param in model.parameters():
param.requires_grad = False
layers = [model.layer4, model.avgpool, model.fc]
for layer in layers[-unfreeze_layers:]:
for param in layer.parameters():
param.requires_grad = True
model.fc = nn.Linear(model.fc.in_features, num_classes)
return model
def train_model(model, train_loader, val_loader, epochs=10,
learning_rate=1e-3, device='cuda'):
model = model.to(device)
criterion = nn.CrossEntropyLoss()
backbone_params = [p for n, p in model.named_parameters()
if 'fc' not in n and p.requires_grad]
head_params = [p for n, p in model.named_parameters()
if 'fc' in n and p.requires_grad]
optimizer = optim.AdamW([
{'params': backbone_params, 'lr': learning_rate * 0.1},
{'params': head_params, 'lr': learning_rate}
], weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
best_val_acc = 0.0
for epoch in range(epochs):
model.train()
train_correct, train_total = 0, 0
for images, labels in tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}'):
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
train_correct += (outputs.argmax(1) == labels).sum().item()
train_total += images.size(0)
model.eval()
val_correct, val_total = 0, 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
val_correct += (outputs.argmax(1) == labels).sum().item()
val_total += images.size(0)
scheduler.step()
val_acc = val_correct / val_total
print(f"Epoch {epoch+1}: Train={train_correct/train_total:.4f}, Val={val_acc:.4f}")
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_model.pt')
print(f"Best validation accuracy: {best_val_acc:.4f}")
return model
# Data augmentation
from torchvision import transforms
def get_transforms(image_size=224):
train_transforms = transforms.Compose([
transforms.RandomResizedCrop(image_size),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.2, contrast=0.2,
saturation=0.2, hue=0.1),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
val_transforms = transforms.Compose([
transforms.Resize(int(image_size * 1.14)),
transforms.CenterCrop(image_size),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
return train_transforms, val_transforms
Architecture Performance Comparison
| Model | Year | Top-1 Accuracy | Parameters | FLOPs |
|---|---|---|---|---|
| LeNet-5 | 1998 | ~99% (MNIST) | 60K | - |
| AlexNet | 2012 | 56.5% | 61M | 724M |
| VGG-16 | 2014 | 71.6% | 138M | 15.5G |
| GoogLeNet | 2014 | 68.7% | 6.8M | 1.5G |
| ResNet-50 | 2015 | 75.3% | 25M | 4.1G |
| DenseNet-121 | 2017 | 74.4% | 8M | 2.9G |
| MobileNetV2 | 2018 | 71.8% | 3.4M | 300M |
| EfficientNet-B0 | 2019 | 77.1% | 5.3M | 390M |
| ConvNeXt-T | 2022 | 82.1% | 28M | 4.5G |
| ViT-B/16 | 2020 | 81.8% | 86M | 17.6G |
Conclusion
CNN architectures have undergone remarkable evolution:
- LeNet (1998): First practical CNN, establishing the foundational structure
- AlexNet (2012): Deep learning renaissance, introduced ReLU and Dropout
- VGGNet (2014): The power of 3x3 convolutions, proving depth matters
- ResNet (2015): Residual connections solved the vanishing gradient problem
- DenseNet (2017): Dense connections maximized feature reuse
- MobileNet (2017): Depthwise separable convolutions enabled mobile deployment
- EfficientNet (2019): Compound scaling achieved state-of-the-art efficiency
- ConvNeXt (2022): Modernized CNN design with Transformer-inspired principles
- ViT (2020): Treating images as sequences opened a new paradigm
In practice, start from torchvision's pretrained models and apply transfer learning to quickly adapt to your target task.
References
- PyTorch Vision Models
- ResNet paper: He et al., "Deep Residual Learning for Image Recognition" (arXiv:1512.03385)
- EfficientNet paper: Tan and Le, "EfficientNet: Rethinking Model Scaling" (arXiv:1905.11946)
- ViT paper: Dosovitskiy et al., "An Image is Worth 16x16 Words" (arXiv:2010.11929)
- ConvNeXt paper: Liu et al., "A ConvNet for the 2020s" (arXiv:2201.03545)