CNN Architecture Complete Guide

Convolutional Neural Networks (CNNs) are the backbone of the computer vision revolution. From LeNet in 1998 to ConvNeXt and Vision Transformers in 2022, CNN architectures have evolved at a remarkable pace. This guide walks through the structural innovations behind each major CNN architecture and teaches you how to implement them in PyTorch.

1. CNN Fundamentals

Understanding Convolution Intuitively

Convolution is an operation that extracts local patterns from an image. A small filter (kernel) slides over the image to produce a feature map.

Input image (5x5)    Kernel (3x3)         Output feature map (3x3)
1 1 1 0 0           1 0 1              4 3 4
0 1 1 1 0    *      0 1 0    =         2 4 3
0 0 1 1 1           1 0 1              2 3 4
0 0 1 1 0
0 1 1 0 0

At each position, the output is the sum of element-wise products between the kernel and the image patch.

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

# Visualize convolution
def visualize_convolution():
    image = torch.tensor([[
        [1., 1., 1., 0., 0.],
        [0., 1., 1., 1., 0.],
        [0., 0., 1., 1., 1.],
        [0., 0., 1., 1., 0.],
        [0., 1., 1., 0., 0.]
    ]]).unsqueeze(0)  # (1, 1, 5, 5)

    # Edge detection kernel
    edge_kernel = torch.tensor([[
        [[-1., -1., -1.],
         [-1.,  8., -1.],
         [-1., -1., -1.]]
    ]])  # (1, 1, 3, 3)

    output = F.conv2d(image, edge_kernel, padding=1)

    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    axes[0].imshow(image[0, 0].numpy(), cmap='gray')
    axes[0].set_title('Input Image')
    axes[1].imshow(edge_kernel[0, 0].numpy(), cmap='RdYlBu')
    axes[1].set_title('Edge Detection Kernel')
    axes[2].imshow(output[0, 0].detach().numpy(), cmap='gray')
    axes[2].set_title('Output Feature Map')
    plt.tight_layout()
    plt.show()

Kernel, Stride, and Padding

import torch
import torch.nn as nn

# Basic Conv2d parameters
conv = nn.Conv2d(
    in_channels=3,    # number of input channels (RGB=3)
    out_channels=64,  # number of output channels (number of filters)
    kernel_size=3,    # kernel size (3x3)
    stride=1,         # stride
    padding=1,        # padding (same padding)
    bias=True
)

# Output size formula
# H_out = floor((H_in + 2*padding - kernel_size) / stride + 1)
def calc_output_size(input_size, kernel_size, stride, padding):
    return (input_size + 2 * padding - kernel_size) // stride + 1

print(calc_output_size(224, 3, 1, 1))   # 224 (same padding)
print(calc_output_size(224, 3, 2, 1))   # 112 (stride 2, halves size)
print(calc_output_size(224, 7, 2, 3))   # 112 (AlexNet first layer)

# Parameter count
# Conv2d: (kernel_h * kernel_w * in_channels + 1) * out_channels
params = (3 * 3 * 3 + 1) * 64
print(f"Conv(3->64, 3x3) parameters: {params:,}")  # 1,792

Pooling (Max, Average, Global)

import torch
import torch.nn as nn

x = torch.randn(1, 64, 28, 28)

# Max Pooling
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
out_max = max_pool(x)  # (1, 64, 14, 14)

# Average Pooling
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
out_avg = avg_pool(x)  # (1, 64, 14, 14)

# Global Average Pooling (GAP) - collapses spatial dimensions to 1x1
gap = nn.AdaptiveAvgPool2d(1)
out_gap = gap(x)                 # (1, 64, 1, 1)
out_gap_flat = out_gap.flatten(1)  # (1, 64)

# Adaptive Pooling - specify output size
adaptive = nn.AdaptiveAvgPool2d((7, 7))
out_adaptive = adaptive(x)  # (1, 64, 7, 7) regardless of input size

print(f"Input: {x.shape}")
print(f"MaxPool: {out_max.shape}")
print(f"GAP: {out_gap_flat.shape}")

Receptive Field Calculation

def calculate_receptive_field(layers):
    """
    Calculate receptive field for each layer.
    layers: list of (kernel_size, stride, dilation)
    """
    rf = 1
    jump = 1

    for k, s, d in layers:
        effective_k = d * (k - 1) + 1
        rf = rf + (effective_k - 1) * jump
        jump = jump * s

    return rf

# VGG-style (3x3 convolutions only)
vgg_layers = [
    (3, 1, 1),  # conv1
    (3, 1, 1),  # conv2
    (2, 2, 1),  # pool
    (3, 1, 1),  # conv3
    (3, 1, 1),  # conv4
    (2, 2, 1),  # pool
]

rf = calculate_receptive_field(vgg_layers)
print(f"Receptive field after 6 VGG layers: {rf}x{rf} pixels")

# Note: two 3x3 convs = same receptive field as one 5x5
# But parameters are 2*(9*C^2) vs 25*C^2 — two 3x3s use 28% fewer params

2. CNN Architecture History

LeNet-5 (1998, LeCun) — The First Practical CNN

LeNet-5, developed by Yann LeCun in 1998, was the first practical CNN, designed for handwritten digit recognition (MNIST).

Architecture: Input(32x32) -> C1(conv, 6@28x28) -> S2(pool, 6@14x14) -> C3(conv, 16@10x10) -> S4(pool, 16@5x5) -> C5(conv, 120@1x1) -> F6(fc, 84) -> Output(10)

import torch
import torch.nn as nn

class LeNet5(nn.Module):
    """LeNet-5 (with ReLU added to original)"""

    def __init__(self, num_classes=10):
        super(LeNet5, self).__init__()

        self.features = nn.Sequential(
            # C1: 1@32x32 -> 6@28x28
            nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
            nn.Tanh(),
            # S2: 6@28x28 -> 6@14x14
            nn.AvgPool2d(kernel_size=2, stride=2),

            # C3: 6@14x14 -> 16@10x10
            nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
            nn.Tanh(),
            # S4: 16@10x10 -> 16@5x5
            nn.AvgPool2d(kernel_size=2, stride=2),

            # C5: 16@5x5 -> 120@1x1
            nn.Conv2d(16, 120, kernel_size=5, stride=1, padding=0),
            nn.Tanh(),
        )

        self.classifier = nn.Sequential(
            nn.Linear(120, 84),
            nn.Tanh(),
            nn.Linear(84, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.flatten(1)
        x = self.classifier(x)
        return x


model = LeNet5(num_classes=10)
x = torch.randn(4, 1, 32, 32)
out = model(x)
print(f"LeNet-5 output: {out.shape}")  # (4, 10)

total_params = sum(p.numel() for p in model.parameters())
print(f"LeNet-5 total parameters: {total_params:,}")  # ~60,000

AlexNet (2012, Krizhevsky) — The Deep Learning Renaissance

AlexNet won the 2012 ImageNet competition with a top-5 error rate of 15.3%, obliterating the previous best of 26.2% and launching the deep learning era.

Key innovations:

ReLU activation (6x faster training than Tanh)
Dropout (0.5) to prevent overfitting
Data augmentation (crops, flips)
Local Response Normalization (LRN)
Dual-GPU training

import torch
import torch.nn as nn

class AlexNet(nn.Module):
    """AlexNet implementation"""

    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()

        self.features = nn.Sequential(
            # Layer 1: 3@224x224 -> 96@55x55
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 96@27x27

            # Layer 2: 96@27x27 -> 256@27x27
            nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 256@13x13

            # Layer 3
            nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),

            # Layer 4
            nn.Conv2d(384, 384, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),

            # Layer 5: 384@13x13 -> 256@13x13
            nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 256@6x6
        )

        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))

        self.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.flatten(1)
        x = self.classifier(x)
        return x


model = AlexNet(num_classes=1000)
x = torch.randn(4, 3, 224, 224)
out = model(x)
print(f"AlexNet output: {out.shape}")  # (4, 1000)
total_params = sum(p.numel() for p in model.parameters())
print(f"AlexNet parameters: {total_params:,}")  # ~61M

VGGNet (2014, Simonyan) — The Power of Depth

VGGNet from Oxford's Visual Geometry Group uses exclusively 3x3 kernels throughout, allowing dramatically increased depth.

Why 3x3?

Two 3x3 convolutions = same receptive field as one 5x5 (saves 28% of parameters)
Three 3x3 convolutions = same receptive field as one 7x7 (saves 45% of parameters)
More non-linear transformations increase representational capacity

import torch
import torch.nn as nn
from typing import List, Union

class VGG(nn.Module):
    """General VGG implementation"""

    def __init__(self, features: nn.Module, num_classes: int = 1000, dropout: float = 0.5):
        super(VGG, self).__init__()
        self.features = features
        self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=dropout),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=dropout),
            nn.Linear(4096, num_classes)
        )
        self._initialize_weights()

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.flatten(1)
        x = self.classifier(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)


def make_layers(cfg: List[Union[str, int]], batch_norm: bool = False) -> nn.Sequential:
    layers: List[nn.Module] = []
    in_channels = 3
    for v in cfg:
        if v == 'M':
            layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
        else:
            v = int(v)
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    return nn.Sequential(*layers)


cfgs = {
    'vgg16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
    'vgg19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}

def vgg16(num_classes=1000):
    return VGG(make_layers(cfgs['vgg16'], batch_norm=True), num_classes=num_classes)

model_vgg16 = vgg16()
x = torch.randn(2, 3, 224, 224)
out = model_vgg16(x)
print(f"VGG-16 output: {out.shape}")
params = sum(p.numel() for p in model_vgg16.parameters())
print(f"VGG-16 parameters: {params:,}")  # ~138M

GoogLeNet/Inception (2014, Szegedy) — Multi-scale Parallel Processing

The Inception module's key idea is to process different kernel sizes (1x1, 3x3, 5x5) in parallel, capturing features at multiple scales simultaneously.

import torch
import torch.nn as nn

class InceptionModule(nn.Module):
    """Basic Inception module"""

    def __init__(self, in_channels, n1x1, n3x3_reduce, n3x3,
                 n5x5_reduce, n5x5, pool_proj):
        super(InceptionModule, self).__init__()

        # 1x1 branch
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, n1x1, kernel_size=1),
            nn.BatchNorm2d(n1x1),
            nn.ReLU(inplace=True)
        )

        # 1x1 bottleneck + 3x3
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, n3x3_reduce, kernel_size=1),
            nn.BatchNorm2d(n3x3_reduce),
            nn.ReLU(inplace=True),
            nn.Conv2d(n3x3_reduce, n3x3, kernel_size=3, padding=1),
            nn.BatchNorm2d(n3x3),
            nn.ReLU(inplace=True)
        )

        # 1x1 bottleneck + 5x5
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, n5x5_reduce, kernel_size=1),
            nn.BatchNorm2d(n5x5_reduce),
            nn.ReLU(inplace=True),
            nn.Conv2d(n5x5_reduce, n5x5, kernel_size=5, padding=2),
            nn.BatchNorm2d(n5x5),
            nn.ReLU(inplace=True)
        )

        # MaxPool + 1x1
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, pool_proj, kernel_size=1),
            nn.BatchNorm2d(pool_proj),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        b1 = self.branch1(x)
        b2 = self.branch2(x)
        b3 = self.branch3(x)
        b4 = self.branch4(x)
        return torch.cat([b1, b2, b3, b4], dim=1)


module = InceptionModule(192, 64, 96, 128, 16, 32, 32)
x = torch.randn(2, 192, 28, 28)
out = module(x)
print(f"Inception output: {out.shape}")  # (2, 256, 28, 28)

ResNet (2015, He) — Solving the Vanishing Gradient with Residual Connections

ResNet, introduced by He Kaiming in 2015, uses skip connections to allow gradients to flow through very deep networks, enabling training of networks with 152 layers.

Core idea: H(x) = F(x) + x

Instead of learning H(x) directly, each layer learns the residual F(x) = H(x) - x. When the optimal mapping is close to the identity, driving F(x) toward zero is much easier.

import torch
import torch.nn as nn
from typing import Optional, Type, List

class BasicBlock(nn.Module):
    """Basic block for ResNet-18/34"""
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(BasicBlock, self).__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.downsample = downsample

    def forward(self, x):
        identity = x

        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  # The residual connection
        out = self.relu(out)

        return out


class Bottleneck(nn.Module):
    """Bottleneck block for ResNet-50/101/152"""
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(Bottleneck, self).__init__()

        # 1x1 (reduce channels)
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)

        # 3x3 (spatial processing)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # 1x1 (expand channels: out_channels * 4)
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion,
                               kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)

        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

    def forward(self, x):
        identity = x

        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class ResNet(nn.Module):
    """Complete ResNet implementation"""

    def __init__(self, block, layers, num_classes=1000):
        super(ResNet, self).__init__()
        self.in_channels = 64

        # Stem
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # 4 stages
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

        # Classifier
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        self._initialize_weights()

    def _make_layer(self, block, out_channels, blocks, stride=1):
        downsample = None
        if stride != 1 or self.in_channels != out_channels * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.in_channels, out_channels * block.expansion,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels * block.expansion)
            )

        layers = [block(self.in_channels, out_channels, stride, downsample)]
        self.in_channels = out_channels * block.expansion

        for _ in range(1, blocks):
            layers.append(block(self.in_channels, out_channels))

        return nn.Sequential(*layers)

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        x = self.maxpool(self.relu(self.bn1(self.conv1(x))))
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x)
        x = x.flatten(1)
        x = self.fc(x)
        return x


def resnet18(num_classes=1000):
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)

def resnet34(num_classes=1000):
    return ResNet(BasicBlock, [3, 4, 6, 3], num_classes)

def resnet50(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)

def resnet101(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 23, 3], num_classes)

def resnet152(num_classes=1000):
    return ResNet(Bottleneck, [3, 8, 36, 3], num_classes)


# Test
for name, model_fn in [('ResNet-18', resnet18), ('ResNet-50', resnet50)]:
    model = model_fn()
    x = torch.randn(2, 3, 224, 224)
    out = model(x)
    params = sum(p.numel() for p in model.parameters())
    print(f"{name}: output={out.shape}, params={params:,}")

DenseNet (2017, Huang) — Dense Connectivity

DenseNet connects each layer to every previous layer. With L layers, ResNet has L connections but DenseNet has L(L+1)/2 connections.

import torch
import torch.nn as nn
import torch.nn.functional as F

class DenseLayer(nn.Module):
    """A single DenseNet layer"""

    def __init__(self, in_channels, growth_rate, bn_size=4, drop_rate=0.0):
        super(DenseLayer, self).__init__()

        # Bottleneck: 1x1 conv to limit channels
        self.norm1 = nn.BatchNorm2d(in_channels)
        self.relu1 = nn.ReLU(inplace=True)
        self.conv1 = nn.Conv2d(in_channels, bn_size * growth_rate, kernel_size=1, bias=False)

        # 3x3 conv
        self.norm2 = nn.BatchNorm2d(bn_size * growth_rate)
        self.relu2 = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(bn_size * growth_rate, growth_rate,
                               kernel_size=3, padding=1, bias=False)

        self.drop_rate = drop_rate

    def forward(self, x):
        if isinstance(x, torch.Tensor):
            prev_features = [x]
        else:
            prev_features = x

        # Concat all previous feature maps
        concat_input = torch.cat(prev_features, dim=1)

        out = self.conv1(self.relu1(self.norm1(concat_input)))
        out = self.conv2(self.relu2(self.norm2(out)))

        if self.drop_rate > 0:
            out = F.dropout(out, p=self.drop_rate, training=self.training)

        return out


class DenseBlock(nn.Module):
    """Dense Block composed of multiple DenseLayers"""

    def __init__(self, num_layers, in_channels, growth_rate, bn_size=4, drop_rate=0.0):
        super(DenseBlock, self).__init__()
        self.layers = nn.ModuleList()

        for i in range(num_layers):
            layer = DenseLayer(
                in_channels + i * growth_rate,
                growth_rate, bn_size, drop_rate
            )
            self.layers.append(layer)

    def forward(self, x):
        features = [x]
        for layer in self.layers:
            new_feat = layer(features)
            features.append(new_feat)
        return torch.cat(features, dim=1)


class TransitionLayer(nn.Module):
    """Transition layer between Dense Blocks (compression + downsampling)"""

    def __init__(self, in_channels, out_channels):
        super(TransitionLayer, self).__init__()
        self.norm = nn.BatchNorm2d(in_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
        self.pool = nn.AvgPool2d(kernel_size=2, stride=2)

    def forward(self, x):
        return self.pool(self.conv(self.relu(self.norm(x))))

MobileNet (2017) — Lightweight for Edge Devices

MobileNet introduced Depthwise Separable Convolutions, drastically reducing computation while maintaining accuracy — ideal for mobile and edge deployment.

import torch
import torch.nn as nn

class DepthwiseSeparableConv(nn.Module):
    """Depthwise Separable Convolution"""

    def __init__(self, in_channels, out_channels, stride=1):
        super(DepthwiseSeparableConv, self).__init__()

        # Depthwise: process each input channel independently
        self.depthwise = nn.Sequential(
            nn.Conv2d(in_channels, in_channels, kernel_size=3,
                      stride=stride, padding=1, groups=in_channels, bias=False),
            nn.BatchNorm2d(in_channels),
            nn.ReLU6(inplace=True)
        )

        # Pointwise: 1x1 conv to combine channels
        self.pointwise = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU6(inplace=True)
        )

    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x


class InvertedResidual(nn.Module):
    """MobileNetV2 inverted residual block"""

    def __init__(self, in_channels, out_channels, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        self.stride = stride
        hidden_dim = int(in_channels * expand_ratio)
        self.use_res_connect = (stride == 1 and in_channels == out_channels)

        layers = []
        if expand_ratio != 1:
            layers += [
                nn.Conv2d(in_channels, hidden_dim, 1, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.ReLU6(inplace=True)
            ]
        layers += [
            nn.Conv2d(hidden_dim, hidden_dim, 3, stride=stride,
                      padding=1, groups=hidden_dim, bias=False),
            nn.BatchNorm2d(hidden_dim),
            nn.ReLU6(inplace=True),
            nn.Conv2d(hidden_dim, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ]
        self.conv = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_res_connect:
            return x + self.conv(x)
        else:
            return self.conv(x)


# Parameter savings
standard_conv_params = 3 * 3 * 512 * 512        # standard convolution
dw_sep_params = (3 * 3 * 512) + (512 * 512)     # depthwise separable
print(f"Standard conv: {standard_conv_params:,}")
print(f"Depthwise Separable: {dw_sep_params:,}")
print(f"Savings: {(1 - dw_sep_params/standard_conv_params):.1%}")

EfficientNet (2019, Tan) — Compound Scaling

EfficientNet proposes scaling width, depth, and resolution together using a compound coefficient, achieving the best accuracy-efficiency tradeoff at the time.

# EfficientNet scaling coefficients
efficientnet_params = {
    'b0': (1.0, 1.0, 224, 0.2),
    'b1': (1.0, 1.1, 240, 0.2),
    'b2': (1.1, 1.2, 260, 0.3),
    'b3': (1.2, 1.4, 300, 0.3),
    'b4': (1.4, 1.8, 380, 0.4),
    'b5': (1.6, 2.2, 456, 0.4),
    'b6': (1.8, 2.6, 528, 0.5),
    'b7': (2.0, 3.1, 600, 0.5),
}

# (width_coeff, depth_coeff, resolution, dropout_rate)
print("EfficientNet scaling parameters:")
for version, (w, d, r, drop) in efficientnet_params.items():
    print(f"  B{version[1]}: width={w:.1f}, depth={d:.1f}, res={r}, dropout={drop}")

ConvNeXt (2022, Liu) — A ConvNet for the 2020s

ConvNeXt modernizes the CNN design space by importing ideas from Vision Transformers — large kernels, LayerNorm, GELU, and inverted bottlenecks — achieving Transformer-competitive performance.

import torch
import torch.nn as nn

class ConvNeXtBlock(nn.Module):
    """ConvNeXt block"""

    def __init__(self, dim, layer_scale_init_value=1e-6):
        super(ConvNeXtBlock, self).__init__()

        # Depthwise Conv with large kernel (7x7)
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)

        # LayerNorm
        self.norm = nn.LayerNorm(dim, eps=1e-6)

        # Inverted Bottleneck (4x channel expansion)
        self.pwconv1 = nn.Linear(dim, 4 * dim)
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)

        # Layer Scale
        self.gamma = nn.Parameter(
            layer_scale_init_value * torch.ones(dim),
            requires_grad=True
        ) if layer_scale_init_value > 0 else None

    def forward(self, x):
        identity = x

        x = self.dwconv(x)
        # (N, C, H, W) -> (N, H, W, C) for LayerNorm
        x = x.permute(0, 2, 3, 1)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)

        if self.gamma is not None:
            x = self.gamma * x

        # (N, H, W, C) -> (N, C, H, W)
        x = x.permute(0, 3, 1, 2)

        return identity + x

3. Vision Transformer (ViT)

ViT splits images into patches and applies a Transformer, treating each patch as a token — a fundamentally different paradigm from traditional CNNs.

import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    """Convert image to patch embeddings"""

    def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super(PatchEmbedding, self).__init__()
        self.num_patches = (image_size // patch_size) ** 2

        # Single convolution performs patch extraction and embedding
        self.projection = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

    def forward(self, x):
        x = self.projection(x)   # (B, embed_dim, H/p, W/p)
        x = x.flatten(2)         # (B, embed_dim, num_patches)
        x = x.transpose(1, 2)    # (B, num_patches, embed_dim)
        return x


class MultiHeadSelfAttention(nn.Module):
    """Multi-head self-attention"""

    def __init__(self, embed_dim, num_heads, dropout=0.0):
        super(MultiHeadSelfAttention, self).__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.scale = self.head_dim ** -0.5

        self.qkv = nn.Linear(embed_dim, embed_dim * 3)
        self.proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, N, C = x.shape

        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv.unbind(0)

        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.dropout(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)

        return x


class TransformerBlock(nn.Module):
    """Transformer block"""

    def __init__(self, embed_dim, num_heads, mlp_ratio=4.0, dropout=0.0):
        super(TransformerBlock, self).__init__()
        self.norm1 = nn.LayerNorm(embed_dim)
        self.attn = MultiHeadSelfAttention(embed_dim, num_heads, dropout)
        self.norm2 = nn.LayerNorm(embed_dim)

        mlp_hidden = int(embed_dim * mlp_ratio)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, mlp_hidden),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(mlp_hidden, embed_dim),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        x = x + self.attn(self.norm1(x))  # residual
        x = x + self.mlp(self.norm2(x))   # residual
        return x


class VisionTransformer(nn.Module):
    """Vision Transformer (ViT)"""

    def __init__(self, image_size=224, patch_size=16, in_channels=3,
                 num_classes=1000, embed_dim=768, depth=12, num_heads=12,
                 mlp_ratio=4.0, dropout=0.0):
        super(VisionTransformer, self).__init__()

        self.patch_embed = PatchEmbedding(image_size, patch_size, in_channels, embed_dim)
        num_patches = self.patch_embed.num_patches

        # CLS token + positional embedding
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embedding = nn.Parameter(
            torch.zeros(1, num_patches + 1, embed_dim)
        )
        self.pos_dropout = nn.Dropout(dropout)

        # Transformer blocks
        self.blocks = nn.Sequential(*[
            TransformerBlock(embed_dim, num_heads, mlp_ratio, dropout)
            for _ in range(depth)
        ])

        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes)

        self._init_weights()

    def _init_weights(self):
        nn.init.trunc_normal_(self.pos_embedding, std=0.02)
        nn.init.trunc_normal_(self.cls_token, std=0.02)
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.trunc_normal_(m.weight, std=0.02)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)

    def forward(self, x):
        B = x.shape[0]

        x = self.patch_embed(x)  # (B, num_patches, embed_dim)

        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)  # prepend CLS token

        x = x + self.pos_embedding
        x = self.pos_dropout(x)

        x = self.blocks(x)
        x = self.norm(x)

        cls_output = x[:, 0]
        return self.head(cls_output)


def vit_base(num_classes=1000):
    return VisionTransformer(
        image_size=224, patch_size=16, embed_dim=768, depth=12,
        num_heads=12, num_classes=num_classes
    )


model = vit_base()
x = torch.randn(2, 3, 224, 224)
out = model(x)
params = sum(p.numel() for p in model.parameters())
print(f"ViT-Base output: {out.shape}, parameters: {params:,}")

4. Object Detection: YOLO

import torch
import torch.nn as nn

class YOLOHead(nn.Module):
    """Simplified YOLO detection head"""

    def __init__(self, in_channels, num_anchors, num_classes):
        super(YOLOHead, self).__init__()
        self.num_anchors = num_anchors
        self.num_classes = num_classes

        # Predict: (x, y, w, h, objectness, num_classes) * num_anchors
        out_channels = num_anchors * (5 + num_classes)

        self.head = nn.Sequential(
            nn.Conv2d(in_channels, in_channels * 2, kernel_size=3, padding=1),
            nn.BatchNorm2d(in_channels * 2),
            nn.LeakyReLU(0.1),
            nn.Conv2d(in_channels * 2, out_channels, kernel_size=1)
        )

    def forward(self, x):
        out = self.head(x)
        B, C, H, W = out.shape
        out = out.reshape(B, self.num_anchors, 5 + self.num_classes, H, W)
        out = out.permute(0, 1, 3, 4, 2).contiguous()
        return out

5. Image Segmentation: U-Net

import torch
import torch.nn as nn
import torch.nn.functional as F

class DoubleConv(nn.Module):
    """U-Net double convolution block"""

    def __init__(self, in_channels, out_channels):
        super(DoubleConv, self).__init__()
        self.double_conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        return self.double_conv(x)


class UNet(nn.Module):
    """U-Net for medical image segmentation"""

    def __init__(self, in_channels=1, num_classes=2, features=[64, 128, 256, 512]):
        super(UNet, self).__init__()

        self.encoders = nn.ModuleList()
        self.decoders = nn.ModuleList()
        self.pool = nn.MaxPool2d(2, 2)

        # Encoder path
        for feature in features:
            self.encoders.append(DoubleConv(in_channels, feature))
            in_channels = feature

        # Bottleneck
        self.bottleneck = DoubleConv(features[-1], features[-1] * 2)

        # Decoder path
        for feature in reversed(features):
            self.decoders.append(
                nn.ConvTranspose2d(feature * 2, feature, kernel_size=2, stride=2)
            )
            self.decoders.append(DoubleConv(feature * 2, feature))

        self.final_conv = nn.Conv2d(features[0], num_classes, kernel_size=1)

    def forward(self, x):
        skip_connections = []

        # Encoder
        for encoder in self.encoders:
            x = encoder(x)
            skip_connections.append(x)
            x = self.pool(x)

        x = self.bottleneck(x)
        skip_connections = skip_connections[::-1]

        # Decoder
        for i in range(0, len(self.decoders), 2):
            x = self.decoders[i](x)
            skip = skip_connections[i // 2]

            if x.shape != skip.shape:
                x = F.interpolate(x, size=skip.shape[2:])

            x = torch.cat([skip, x], dim=1)  # Skip connection
            x = self.decoders[i + 1](x)

        return self.final_conv(x)


model = UNet(in_channels=1, num_classes=2)
x = torch.randn(4, 1, 572, 572)
out = model(x)
print(f"U-Net output: {out.shape}")  # (4, 2, 572, 572)

6. Transfer Learning in Practice

Using torchvision.models

import torch
import torch.nn as nn
import torchvision.models as models
import torch.optim as optim
from tqdm import tqdm

# Load pretrained models
model_resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model_efficientnet = models.efficientnet_b4(weights=models.EfficientNet_B4_Weights.DEFAULT)
model_vit = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)


def feature_extraction(num_classes, freeze=True):
    """Feature extraction: freeze backbone, train only classifier"""
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

    if freeze:
        for param in model.parameters():
            param.requires_grad = False

    # Replace classifier
    in_features = model.fc.in_features
    model.fc = nn.Sequential(
        nn.Dropout(0.5),
        nn.Linear(in_features, 256),
        nn.ReLU(),
        nn.Linear(256, num_classes)
    )

    for param in model.fc.parameters():
        param.requires_grad = True

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable parameters: {trainable:,} / {total:,} ({trainable/total:.1%})")

    return model


def fine_tuning(num_classes, unfreeze_layers=2):
    """Fine-tuning: unfreeze last few layers"""
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

    for param in model.parameters():
        param.requires_grad = False

    layers = [model.layer4, model.avgpool, model.fc]
    for layer in layers[-unfreeze_layers:]:
        for param in layer.parameters():
            param.requires_grad = True

    model.fc = nn.Linear(model.fc.in_features, num_classes)

    return model


def train_model(model, train_loader, val_loader, epochs=10,
                learning_rate=1e-3, device='cuda'):

    model = model.to(device)
    criterion = nn.CrossEntropyLoss()

    backbone_params = [p for n, p in model.named_parameters()
                       if 'fc' not in n and p.requires_grad]
    head_params = [p for n, p in model.named_parameters()
                   if 'fc' in n and p.requires_grad]

    optimizer = optim.AdamW([
        {'params': backbone_params, 'lr': learning_rate * 0.1},
        {'params': head_params, 'lr': learning_rate}
    ], weight_decay=1e-4)

    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

    best_val_acc = 0.0

    for epoch in range(epochs):
        model.train()
        train_correct, train_total = 0, 0

        for images, labels in tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}'):
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            train_correct += (outputs.argmax(1) == labels).sum().item()
            train_total += images.size(0)

        model.eval()
        val_correct, val_total = 0, 0

        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                val_correct += (outputs.argmax(1) == labels).sum().item()
                val_total += images.size(0)

        scheduler.step()

        val_acc = val_correct / val_total
        print(f"Epoch {epoch+1}: Train={train_correct/train_total:.4f}, Val={val_acc:.4f}")

        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pt')

    print(f"Best validation accuracy: {best_val_acc:.4f}")
    return model


# Data augmentation
from torchvision import transforms

def get_transforms(image_size=224):
    train_transforms = transforms.Compose([
        transforms.RandomResizedCrop(image_size),
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(15),
        transforms.ColorJitter(brightness=0.2, contrast=0.2,
                               saturation=0.2, hue=0.1),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])

    val_transforms = transforms.Compose([
        transforms.Resize(int(image_size * 1.14)),
        transforms.CenterCrop(image_size),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])

    return train_transforms, val_transforms

Architecture Performance Comparison

Model	Year	Top-1 Accuracy	Parameters	FLOPs
LeNet-5	1998	~99% (MNIST)	60K	-
AlexNet	2012	56.5%	61M	724M
VGG-16	2014	71.6%	138M	15.5G
GoogLeNet	2014	68.7%	6.8M	1.5G
ResNet-50	2015	75.3%	25M	4.1G
DenseNet-121	2017	74.4%	8M	2.9G
MobileNetV2	2018	71.8%	3.4M	300M
EfficientNet-B0	2019	77.1%	5.3M	390M
ConvNeXt-T	2022	82.1%	28M	4.5G
ViT-B/16	2020	81.8%	86M	17.6G

Conclusion

CNN architectures have undergone remarkable evolution:

LeNet (1998): First practical CNN, establishing the foundational structure
AlexNet (2012): Deep learning renaissance, introduced ReLU and Dropout
VGGNet (2014): The power of 3x3 convolutions, proving depth matters
ResNet (2015): Residual connections solved the vanishing gradient problem
DenseNet (2017): Dense connections maximized feature reuse
MobileNet (2017): Depthwise separable convolutions enabled mobile deployment
EfficientNet (2019): Compound scaling achieved state-of-the-art efficiency
ConvNeXt (2022): Modernized CNN design with Transformer-inspired principles
ViT (2020): Treating images as sequences opened a new paradigm

In practice, start from torchvision's pretrained models and apply transfer learning to quickly adapt to your target task.

References

PyTorch Vision Models
ResNet paper: He et al., "Deep Residual Learning for Image Recognition" (arXiv:1512.03385)
EfficientNet paper: Tan and Le, "EfficientNet: Rethinking Model Scaling" (arXiv:1905.11946)
ViT paper: Dosovitskiy et al., "An Image is Worth 16x16 Words" (arXiv:2010.11929)
ConvNeXt paper: Liu et al., "A ConvNet for the 2020s" (arXiv:2201.03545)