💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

CNN Architecture Complete Guide

Convolutional Neural Networks (CNNs) are the backbone of the computer vision revolution. From LeNet in 1998 to ConvNeXt and Vision Transformers in 2022, CNN architectures have evolved at a remarkable pace. This guide walks through the structural innovations behind each major CNN architecture and teaches you how to implement them in PyTorch.

1. CNN Fundamentals

Understanding Convolution Intuitively

Convolution is an operation that extracts local patterns from an image. A small filter (kernel) slides over the image to produce a feature map.

Input image (5x5) Kernel (3x3) Output feature map (3x3)

1 1 1 0 0 1 0 1 4 3 4

0 1 1 1 0 * 0 1 0 = 2 4 3

0 0 1 1 1 1 0 1 2 3 4

0 0 1 1 0

0 1 1 0 0

At each position, the output is the sum of element-wise products between the kernel and the image patch.

Visualize convolution

def visualize_convolution():

image = torch.tensor([[

[1., 1., 1., 0., 0.],

[0., 1., 1., 1., 0.],

[0., 0., 1., 1., 1.],

[0., 0., 1., 1., 0.],

[0., 1., 1., 0., 0.]

]]).unsqueeze(0) # (1, 1, 5, 5)

Edge detection kernel

edge_kernel = torch.tensor([[

[[-1., -1., -1.],

[-1., 8., -1.],

[-1., -1., -1.]]

]]) # (1, 1, 3, 3)

output = F.conv2d(image, edge_kernel, padding=1)

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

axes[0].imshow(image[0, 0].numpy(), cmap='gray')

axes[0].set_title('Input Image')

axes[1].imshow(edge_kernel[0, 0].numpy(), cmap='RdYlBu')

axes[1].set_title('Edge Detection Kernel')

axes[2].imshow(output[0, 0].detach().numpy(), cmap='gray')

axes[2].set_title('Output Feature Map')

plt.tight_layout()

plt.show()

Kernel, Stride, and Padding

Basic Conv2d parameters

conv = nn.Conv2d(

in_channels=3, # number of input channels (RGB=3)

out_channels=64, # number of output channels (number of filters)

kernel_size=3, # kernel size (3x3)

stride=1, # stride

padding=1, # padding (same padding)

bias=True

)

Output size formula

H_out = floor((H_in + 2*padding - kernel_size) / stride + 1)

def calc_output_size(input_size, kernel_size, stride, padding):

return (input_size + 2 * padding - kernel_size) // stride + 1

print(calc_output_size(224, 3, 1, 1)) # 224 (same padding)

print(calc_output_size(224, 3, 2, 1)) # 112 (stride 2, halves size)

print(calc_output_size(224, 7, 2, 3)) # 112 (AlexNet first layer)

Parameter count

Conv2d: (kernel_h * kernel_w * in_channels + 1) * out_channels

params = (3 * 3 * 3 + 1) * 64

print(f"Conv(3->64, 3x3) parameters: {params:,}") # 1,792

Pooling (Max, Average, Global)

x = torch.randn(1, 64, 28, 28)

Max Pooling

max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

out_max = max_pool(x) # (1, 64, 14, 14)

Average Pooling

avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

out_avg = avg_pool(x) # (1, 64, 14, 14)

Global Average Pooling (GAP) - collapses spatial dimensions to 1x1

gap = nn.AdaptiveAvgPool2d(1)

out_gap = gap(x) # (1, 64, 1, 1)

out_gap_flat = out_gap.flatten(1) # (1, 64)

Adaptive Pooling - specify output size

adaptive = nn.AdaptiveAvgPool2d((7, 7))

out_adaptive = adaptive(x) # (1, 64, 7, 7) regardless of input size

print(f"Input: {x.shape}")

print(f"MaxPool: {out_max.shape}")

print(f"GAP: {out_gap_flat.shape}")

Receptive Field Calculation

def calculate_receptive_field(layers):

"""

Calculate receptive field for each layer.

layers: list of (kernel_size, stride, dilation)

"""

rf = 1

jump = 1

for k, s, d in layers:

effective_k = d * (k - 1) + 1

rf = rf + (effective_k - 1) * jump

jump = jump * s

return rf

VGG-style (3x3 convolutions only)

vgg_layers = [

(3, 1, 1), # conv1

(3, 1, 1), # conv2

(2, 2, 1), # pool

(3, 1, 1), # conv3

(3, 1, 1), # conv4

(2, 2, 1), # pool

]

rf = calculate_receptive_field(vgg_layers)

print(f"Receptive field after 6 VGG layers: {rf}x{rf} pixels")

Note: two 3x3 convs = same receptive field as one 5x5

But parameters are 2(9C^2) vs 25*C^2 — two 3x3s use 28% fewer params

2. CNN Architecture History

LeNet-5 (1998, LeCun) — The First Practical CNN

LeNet-5, developed by Yann LeCun in 1998, was the first practical CNN, designed for handwritten digit recognition (MNIST).

**Architecture**: Input(32x32) -> C1(conv, 6@28x28) -> S2(pool, 6@14x14) -> C3(conv, 16@10x10) -> S4(pool, 16@5x5) -> C5(conv, 120@1x1) -> F6(fc, 84) -> Output(10)

class LeNet5(nn.Module):

"""LeNet-5 (with ReLU added to original)"""

def __init__(self, num_classes=10):

super(LeNet5, self).__init__()

self.features = nn.Sequential(

C1: 1@32x32 -> 6@28x28

nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),

nn.Tanh(),

S2: 6@28x28 -> 6@14x14

nn.AvgPool2d(kernel_size=2, stride=2),

C3: 6@14x14 -> 16@10x10

nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),

nn.Tanh(),

S4: 16@10x10 -> 16@5x5

nn.AvgPool2d(kernel_size=2, stride=2),

C5: 16@5x5 -> 120@1x1

nn.Conv2d(16, 120, kernel_size=5, stride=1, padding=0),

nn.Tanh(),

)

self.classifier = nn.Sequential(

nn.Linear(120, 84),

nn.Tanh(),

nn.Linear(84, num_classes)

)

def forward(self, x):

x = self.features(x)

x = x.flatten(1)

x = self.classifier(x)

return x

model = LeNet5(num_classes=10)

x = torch.randn(4, 1, 32, 32)

out = model(x)

print(f"LeNet-5 output: {out.shape}") # (4, 10)

total_params = sum(p.numel() for p in model.parameters())

print(f"LeNet-5 total parameters: {total_params:,}") # ~60,000

AlexNet (2012, Krizhevsky) — The Deep Learning Renaissance

AlexNet won the 2012 ImageNet competition with a top-5 error rate of 15.3%, obliterating the previous best of 26.2% and launching the deep learning era.

**Key innovations**:

- ReLU activation (6x faster training than Tanh)

- Dropout (0.5) to prevent overfitting

- Data augmentation (crops, flips)

- Local Response Normalization (LRN)

- Dual-GPU training

class AlexNet(nn.Module):

"""AlexNet implementation"""

def __init__(self, num_classes=1000):

super(AlexNet, self).__init__()

self.features = nn.Sequential(

Layer 1: 3@224x224 -> 96@55x55

nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),

nn.ReLU(inplace=True),

nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2),

nn.MaxPool2d(kernel_size=3, stride=2), # 96@27x27

Layer 2: 96@27x27 -> 256@27x27

nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2),

nn.ReLU(inplace=True),

nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2),

nn.MaxPool2d(kernel_size=3, stride=2), # 256@13x13

Layer 3

nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1),

nn.ReLU(inplace=True),

Layer 4

nn.Conv2d(384, 384, kernel_size=3, stride=1, padding=1),

nn.ReLU(inplace=True),

Layer 5: 384@13x13 -> 256@13x13

nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),

nn.ReLU(inplace=True),

nn.MaxPool2d(kernel_size=3, stride=2), # 256@6x6

)

self.avgpool = nn.AdaptiveAvgPool2d((6, 6))

self.classifier = nn.Sequential(

nn.Dropout(p=0.5),

nn.Linear(256 * 6 * 6, 4096),

nn.ReLU(inplace=True),

nn.Dropout(p=0.5),

nn.Linear(4096, 4096),

nn.ReLU(inplace=True),

nn.Linear(4096, num_classes)

)

def forward(self, x):

x = self.features(x)

x = self.avgpool(x)

x = x.flatten(1)

x = self.classifier(x)

return x

model = AlexNet(num_classes=1000)

x = torch.randn(4, 3, 224, 224)

out = model(x)

print(f"AlexNet output: {out.shape}") # (4, 1000)

total_params = sum(p.numel() for p in model.parameters())

print(f"AlexNet parameters: {total_params:,}") # ~61M

VGGNet (2014, Simonyan) — The Power of Depth

VGGNet from Oxford's Visual Geometry Group uses exclusively 3x3 kernels throughout, allowing dramatically increased depth.

**Why 3x3?**

- Two 3x3 convolutions = same receptive field as one 5x5 (saves 28% of parameters)

- Three 3x3 convolutions = same receptive field as one 7x7 (saves 45% of parameters)

- More non-linear transformations increase representational capacity

from typing import List, Union

class VGG(nn.Module):

"""General VGG implementation"""

def __init__(self, features: nn.Module, num_classes: int = 1000, dropout: float = 0.5):

super(VGG, self).__init__()

self.features = features

self.avgpool = nn.AdaptiveAvgPool2d((7, 7))

self.classifier = nn.Sequential(

nn.Linear(512 * 7 * 7, 4096),

nn.ReLU(inplace=True),

nn.Dropout(p=dropout),

nn.Linear(4096, 4096),

nn.ReLU(inplace=True),

nn.Dropout(p=dropout),

nn.Linear(4096, num_classes)

)

self._initialize_weights()

def forward(self, x):

x = self.features(x)

x = self.avgpool(x)

x = x.flatten(1)

x = self.classifier(x)

return x

def _initialize_weights(self):

for m in self.modules():

if isinstance(m, nn.Conv2d):

nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')

if m.bias is not None:

nn.init.constant_(m.bias, 0)

elif isinstance(m, nn.BatchNorm2d):

nn.init.constant_(m.weight, 1)

nn.init.constant_(m.bias, 0)

elif isinstance(m, nn.Linear):

nn.init.normal_(m.weight, 0, 0.01)

nn.init.constant_(m.bias, 0)

def make_layers(cfg: List[Union[str, int]], batch_norm: bool = False) -> nn.Sequential:

layers: List[nn.Module] = []

in_channels = 3

for v in cfg:

if v == 'M':

layers.append(nn.MaxPool2d(kernel_size=2, stride=2))

else:

v = int(v)

conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)

if batch_norm:

layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]

else:

layers += [conv2d, nn.ReLU(inplace=True)]

in_channels = v

return nn.Sequential(*layers)

cfgs = {

'vgg16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],

'vgg19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],

}

def vgg16(num_classes=1000):

return VGG(make_layers(cfgs['vgg16'], batch_norm=True), num_classes=num_classes)

model_vgg16 = vgg16()

x = torch.randn(2, 3, 224, 224)

out = model_vgg16(x)

print(f"VGG-16 output: {out.shape}")

params = sum(p.numel() for p in model_vgg16.parameters())

print(f"VGG-16 parameters: {params:,}") # ~138M

GoogLeNet/Inception (2014, Szegedy) — Multi-scale Parallel Processing

The Inception module's key idea is to process different kernel sizes (1x1, 3x3, 5x5) in parallel, capturing features at multiple scales simultaneously.

class InceptionModule(nn.Module):

"""Basic Inception module"""

def __init__(self, in_channels, n1x1, n3x3_reduce, n3x3,

n5x5_reduce, n5x5, pool_proj):

super(InceptionModule, self).__init__()

1x1 branch

self.branch1 = nn.Sequential(

nn.Conv2d(in_channels, n1x1, kernel_size=1),

nn.BatchNorm2d(n1x1),

nn.ReLU(inplace=True)

)

1x1 bottleneck + 3x3

self.branch2 = nn.Sequential(

nn.Conv2d(in_channels, n3x3_reduce, kernel_size=1),

nn.BatchNorm2d(n3x3_reduce),

nn.ReLU(inplace=True),

nn.Conv2d(n3x3_reduce, n3x3, kernel_size=3, padding=1),

nn.BatchNorm2d(n3x3),

nn.ReLU(inplace=True)

)

1x1 bottleneck + 5x5

self.branch3 = nn.Sequential(

nn.Conv2d(in_channels, n5x5_reduce, kernel_size=1),

nn.BatchNorm2d(n5x5_reduce),

nn.ReLU(inplace=True),

nn.Conv2d(n5x5_reduce, n5x5, kernel_size=5, padding=2),

nn.BatchNorm2d(n5x5),

nn.ReLU(inplace=True)

)

MaxPool + 1x1

self.branch4 = nn.Sequential(

nn.MaxPool2d(kernel_size=3, stride=1, padding=1),

nn.Conv2d(in_channels, pool_proj, kernel_size=1),

nn.BatchNorm2d(pool_proj),

nn.ReLU(inplace=True)

)

def forward(self, x):

b1 = self.branch1(x)

b2 = self.branch2(x)

b3 = self.branch3(x)

b4 = self.branch4(x)

return torch.cat([b1, b2, b3, b4], dim=1)

module = InceptionModule(192, 64, 96, 128, 16, 32, 32)

x = torch.randn(2, 192, 28, 28)

out = module(x)

print(f"Inception output: {out.shape}") # (2, 256, 28, 28)

ResNet (2015, He) — Solving the Vanishing Gradient with Residual Connections

ResNet, introduced by He Kaiming in 2015, uses skip connections to allow gradients to flow through very deep networks, enabling training of networks with 152 layers.

**Core idea**: H(x) = F(x) + x

Instead of learning H(x) directly, each layer learns the residual F(x) = H(x) - x. When the optimal mapping is close to the identity, driving F(x) toward zero is much easier.

from typing import Optional, Type, List

class BasicBlock(nn.Module):

"""Basic block for ResNet-18/34"""

expansion = 1

def __init__(self, in_channels, out_channels, stride=1, downsample=None):

super(BasicBlock, self).__init__()

self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,

stride=stride, padding=1, bias=False)

self.bn1 = nn.BatchNorm2d(out_channels)

self.relu = nn.ReLU(inplace=True)

self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,

stride=1, padding=1, bias=False)

self.bn2 = nn.BatchNorm2d(out_channels)

self.downsample = downsample

def forward(self, x):

identity = x

out = self.relu(self.bn1(self.conv1(x)))

out = self.bn2(self.conv2(out))

if self.downsample is not None:

identity = self.downsample(x)

out += identity # The residual connection

out = self.relu(out)

return out

class Bottleneck(nn.Module):

"""Bottleneck block for ResNet-50/101/152"""

expansion = 4

def __init__(self, in_channels, out_channels, stride=1, downsample=None):

super(Bottleneck, self).__init__()

1x1 (reduce channels)

self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)

self.bn1 = nn.BatchNorm2d(out_channels)

3x3 (spatial processing)

self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,

stride=stride, padding=1, bias=False)

self.bn2 = nn.BatchNorm2d(out_channels)

1x1 (expand channels: out_channels * 4)

self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion,

kernel_size=1, bias=False)

self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)

self.relu = nn.ReLU(inplace=True)

self.downsample = downsample

def forward(self, x):

identity = x

out = self.relu(self.bn1(self.conv1(x)))

out = self.relu(self.bn2(self.conv2(out)))

out = self.bn3(self.conv3(out))

if self.downsample is not None:

identity = self.downsample(x)

out += identity

out = self.relu(out)

return out

class ResNet(nn.Module):

"""Complete ResNet implementation"""

def __init__(self, block, layers, num_classes=1000):

super(ResNet, self).__init__()

self.in_channels = 64

Stem

self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)

self.bn1 = nn.BatchNorm2d(64)

self.relu = nn.ReLU(inplace=True)

self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

4 stages

self.layer1 = self._make_layer(block, 64, layers[0])

self.layer2 = self._make_layer(block, 128, layers[1], stride=2)

self.layer3 = self._make_layer(block, 256, layers[2], stride=2)

self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

Classifier

self.avgpool = nn.AdaptiveAvgPool2d((1, 1))

self.fc = nn.Linear(512 * block.expansion, num_classes)

self._initialize_weights()

def _make_layer(self, block, out_channels, blocks, stride=1):

downsample = None

if stride != 1 or self.in_channels != out_channels * block.expansion:

downsample = nn.Sequential(

nn.Conv2d(self.in_channels, out_channels * block.expansion,

kernel_size=1, stride=stride, bias=False),

nn.BatchNorm2d(out_channels * block.expansion)

)

layers = [block(self.in_channels, out_channels, stride, downsample)]

self.in_channels = out_channels * block.expansion

for _ in range(1, blocks):

layers.append(block(self.in_channels, out_channels))

return nn.Sequential(*layers)

def _initialize_weights(self):

for m in self.modules():

if isinstance(m, nn.Conv2d):

nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')

elif isinstance(m, nn.BatchNorm2d):

nn.init.constant_(m.weight, 1)

nn.init.constant_(m.bias, 0)

def forward(self, x):

x = self.maxpool(self.relu(self.bn1(self.conv1(x))))

x = self.layer1(x)

x = self.layer2(x)

x = self.layer3(x)

x = self.layer4(x)

x = self.avgpool(x)

x = x.flatten(1)

x = self.fc(x)

return x

def resnet18(num_classes=1000):

return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)

def resnet34(num_classes=1000):

return ResNet(BasicBlock, [3, 4, 6, 3], num_classes)

def resnet50(num_classes=1000):

return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)

def resnet101(num_classes=1000):

return ResNet(Bottleneck, [3, 4, 23, 3], num_classes)

def resnet152(num_classes=1000):

return ResNet(Bottleneck, [3, 8, 36, 3], num_classes)

Test

for name, model_fn in [('ResNet-18', resnet18), ('ResNet-50', resnet50)]:

model = model_fn()

x = torch.randn(2, 3, 224, 224)

out = model(x)

params = sum(p.numel() for p in model.parameters())

print(f"{name}: output={out.shape}, params={params:,}")

DenseNet (2017, Huang) — Dense Connectivity

DenseNet connects each layer to every previous layer. With L layers, ResNet has L connections but DenseNet has L(L+1)/2 connections.

class DenseLayer(nn.Module):

"""A single DenseNet layer"""

def __init__(self, in_channels, growth_rate, bn_size=4, drop_rate=0.0):

super(DenseLayer, self).__init__()

Bottleneck: 1x1 conv to limit channels

self.norm1 = nn.BatchNorm2d(in_channels)

self.relu1 = nn.ReLU(inplace=True)

self.conv1 = nn.Conv2d(in_channels, bn_size * growth_rate, kernel_size=1, bias=False)

3x3 conv

self.norm2 = nn.BatchNorm2d(bn_size * growth_rate)

self.relu2 = nn.ReLU(inplace=True)

self.conv2 = nn.Conv2d(bn_size * growth_rate, growth_rate,

kernel_size=3, padding=1, bias=False)

self.drop_rate = drop_rate

def forward(self, x):

if isinstance(x, torch.Tensor):

prev_features = [x]

else:

prev_features = x

Concat all previous feature maps

concat_input = torch.cat(prev_features, dim=1)

out = self.conv1(self.relu1(self.norm1(concat_input)))

out = self.conv2(self.relu2(self.norm2(out)))

if self.drop_rate > 0:

out = F.dropout(out, p=self.drop_rate, training=self.training)

return out

class DenseBlock(nn.Module):

"""Dense Block composed of multiple DenseLayers"""

def __init__(self, num_layers, in_channels, growth_rate, bn_size=4, drop_rate=0.0):

super(DenseBlock, self).__init__()

self.layers = nn.ModuleList()

for i in range(num_layers):

layer = DenseLayer(

in_channels + i * growth_rate,

growth_rate, bn_size, drop_rate

)

self.layers.append(layer)

def forward(self, x):

features = [x]

for layer in self.layers:

new_feat = layer(features)

features.append(new_feat)

return torch.cat(features, dim=1)

class TransitionLayer(nn.Module):

"""Transition layer between Dense Blocks (compression + downsampling)"""

def __init__(self, in_channels, out_channels):

super(TransitionLayer, self).__init__()

self.norm = nn.BatchNorm2d(in_channels)

self.relu = nn.ReLU(inplace=True)

self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)

self.pool = nn.AvgPool2d(kernel_size=2, stride=2)

def forward(self, x):

return self.pool(self.conv(self.relu(self.norm(x))))

MobileNet (2017) — Lightweight for Edge Devices

MobileNet introduced Depthwise Separable Convolutions, drastically reducing computation while maintaining accuracy — ideal for mobile and edge deployment.

class DepthwiseSeparableConv(nn.Module):

"""Depthwise Separable Convolution"""

def __init__(self, in_channels, out_channels, stride=1):

super(DepthwiseSeparableConv, self).__init__()

Depthwise: process each input channel independently

self.depthwise = nn.Sequential(

nn.Conv2d(in_channels, in_channels, kernel_size=3,

stride=stride, padding=1, groups=in_channels, bias=False),

nn.BatchNorm2d(in_channels),

nn.ReLU6(inplace=True)

)

Pointwise: 1x1 conv to combine channels

self.pointwise = nn.Sequential(

nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),

nn.BatchNorm2d(out_channels),

nn.ReLU6(inplace=True)

)

def forward(self, x):

x = self.depthwise(x)

x = self.pointwise(x)

return x

class InvertedResidual(nn.Module):

"""MobileNetV2 inverted residual block"""

def __init__(self, in_channels, out_channels, stride, expand_ratio):

super(InvertedResidual, self).__init__()

self.stride = stride

hidden_dim = int(in_channels * expand_ratio)

self.use_res_connect = (stride == 1 and in_channels == out_channels)

layers = []

if expand_ratio != 1:

layers += [

nn.Conv2d(in_channels, hidden_dim, 1, bias=False),

nn.BatchNorm2d(hidden_dim),

nn.ReLU6(inplace=True)

]

layers += [

nn.Conv2d(hidden_dim, hidden_dim, 3, stride=stride,

padding=1, groups=hidden_dim, bias=False),

nn.BatchNorm2d(hidden_dim),

nn.ReLU6(inplace=True),

nn.Conv2d(hidden_dim, out_channels, 1, bias=False),

nn.BatchNorm2d(out_channels)

]

self.conv = nn.Sequential(*layers)

def forward(self, x):

if self.use_res_connect:

return x + self.conv(x)

else:

return self.conv(x)

Parameter savings

standard_conv_params = 3 * 3 * 512 * 512 # standard convolution

dw_sep_params = (3 * 3 * 512) + (512 * 512) # depthwise separable

print(f"Standard conv: {standard_conv_params:,}")

print(f"Depthwise Separable: {dw_sep_params:,}")

print(f"Savings: {(1 - dw_sep_params/standard_conv_params):.1%}")

EfficientNet (2019, Tan) — Compound Scaling

EfficientNet proposes scaling width, depth, and resolution together using a compound coefficient, achieving the best accuracy-efficiency tradeoff at the time.

EfficientNet scaling coefficients

efficientnet_params = {

'b0': (1.0, 1.0, 224, 0.2),

'b1': (1.0, 1.1, 240, 0.2),

'b2': (1.1, 1.2, 260, 0.3),

'b3': (1.2, 1.4, 300, 0.3),

'b4': (1.4, 1.8, 380, 0.4),

'b5': (1.6, 2.2, 456, 0.4),

'b6': (1.8, 2.6, 528, 0.5),

'b7': (2.0, 3.1, 600, 0.5),

}

(width_coeff, depth_coeff, resolution, dropout_rate)

print("EfficientNet scaling parameters:")

for version, (w, d, r, drop) in efficientnet_params.items():

print(f" B{version[1]}: width={w:.1f}, depth={d:.1f}, res={r}, dropout={drop}")

ConvNeXt (2022, Liu) — A ConvNet for the 2020s

ConvNeXt modernizes the CNN design space by importing ideas from Vision Transformers — large kernels, LayerNorm, GELU, and inverted bottlenecks — achieving Transformer-competitive performance.

class ConvNeXtBlock(nn.Module):

"""ConvNeXt block"""

def __init__(self, dim, layer_scale_init_value=1e-6):

super(ConvNeXtBlock, self).__init__()

Depthwise Conv with large kernel (7x7)

self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)

LayerNorm

self.norm = nn.LayerNorm(dim, eps=1e-6)

Inverted Bottleneck (4x channel expansion)

self.pwconv1 = nn.Linear(dim, 4 * dim)

self.act = nn.GELU()

self.pwconv2 = nn.Linear(4 * dim, dim)

Layer Scale

self.gamma = nn.Parameter(

layer_scale_init_value * torch.ones(dim),

requires_grad=True

) if layer_scale_init_value > 0 else None

def forward(self, x):

identity = x

x = self.dwconv(x)

(N, C, H, W) -> (N, H, W, C) for LayerNorm

x = x.permute(0, 2, 3, 1)

x = self.norm(x)

x = self.pwconv1(x)

x = self.act(x)

x = self.pwconv2(x)

if self.gamma is not None:

x = self.gamma * x

(N, H, W, C) -> (N, C, H, W)

x = x.permute(0, 3, 1, 2)

return identity + x

3. Vision Transformer (ViT)

ViT splits images into patches and applies a Transformer, treating each patch as a token — a fundamentally different paradigm from traditional CNNs.

class PatchEmbedding(nn.Module):

"""Convert image to patch embeddings"""

def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768):

super(PatchEmbedding, self).__init__()

self.num_patches = (image_size // patch_size) ** 2

Single convolution performs patch extraction and embedding

self.projection = nn.Conv2d(

in_channels, embed_dim,

kernel_size=patch_size, stride=patch_size

)

def forward(self, x):

x = self.projection(x) # (B, embed_dim, H/p, W/p)

x = x.flatten(2) # (B, embed_dim, num_patches)

x = x.transpose(1, 2) # (B, num_patches, embed_dim)

return x

class MultiHeadSelfAttention(nn.Module):

"""Multi-head self-attention"""

def __init__(self, embed_dim, num_heads, dropout=0.0):

super(MultiHeadSelfAttention, self).__init__()

self.num_heads = num_heads

self.head_dim = embed_dim // num_heads

self.scale = self.head_dim ** -0.5

self.qkv = nn.Linear(embed_dim, embed_dim * 3)

self.proj = nn.Linear(embed_dim, embed_dim)

self.dropout = nn.Dropout(dropout)

def forward(self, x):

B, N, C = x.shape

qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)

qkv = qkv.permute(2, 0, 3, 1, 4)

q, k, v = qkv.unbind(0)

attn = (q @ k.transpose(-2, -1)) * self.scale

attn = attn.softmax(dim=-1)

attn = self.dropout(attn)

x = (attn @ v).transpose(1, 2).reshape(B, N, C)

x = self.proj(x)

return x

class TransformerBlock(nn.Module):

"""Transformer block"""

def __init__(self, embed_dim, num_heads, mlp_ratio=4.0, dropout=0.0):

super(TransformerBlock, self).__init__()

self.norm1 = nn.LayerNorm(embed_dim)

self.attn = MultiHeadSelfAttention(embed_dim, num_heads, dropout)

self.norm2 = nn.LayerNorm(embed_dim)

mlp_hidden = int(embed_dim * mlp_ratio)

self.mlp = nn.Sequential(

nn.Linear(embed_dim, mlp_hidden),

nn.GELU(),

nn.Dropout(dropout),

nn.Linear(mlp_hidden, embed_dim),

nn.Dropout(dropout)

)

def forward(self, x):

x = x + self.attn(self.norm1(x)) # residual

x = x + self.mlp(self.norm2(x)) # residual

return x

class VisionTransformer(nn.Module):

"""Vision Transformer (ViT)"""

def __init__(self, image_size=224, patch_size=16, in_channels=3,

num_classes=1000, embed_dim=768, depth=12, num_heads=12,

mlp_ratio=4.0, dropout=0.0):

super(VisionTransformer, self).__init__()

self.patch_embed = PatchEmbedding(image_size, patch_size, in_channels, embed_dim)

num_patches = self.patch_embed.num_patches

CLS token + positional embedding

self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))

self.pos_embedding = nn.Parameter(

torch.zeros(1, num_patches + 1, embed_dim)

)

self.pos_dropout = nn.Dropout(dropout)

Transformer blocks

self.blocks = nn.Sequential(*[

TransformerBlock(embed_dim, num_heads, mlp_ratio, dropout)

for _ in range(depth)

])

self.norm = nn.LayerNorm(embed_dim)

self.head = nn.Linear(embed_dim, num_classes)

self._init_weights()

def _init_weights(self):

nn.init.trunc_normal_(self.pos_embedding, std=0.02)

nn.init.trunc_normal_(self.cls_token, std=0.02)

for m in self.modules():

if isinstance(m, nn.Linear):

nn.init.trunc_normal_(m.weight, std=0.02)

if m.bias is not None:

nn.init.zeros_(m.bias)

def forward(self, x):

B = x.shape[0]

x = self.patch_embed(x) # (B, num_patches, embed_dim)

cls_tokens = self.cls_token.expand(B, -1, -1)

x = torch.cat([cls_tokens, x], dim=1) # prepend CLS token

x = x + self.pos_embedding

x = self.pos_dropout(x)

x = self.blocks(x)

x = self.norm(x)

cls_output = x[:, 0]

return self.head(cls_output)

def vit_base(num_classes=1000):

return VisionTransformer(

image_size=224, patch_size=16, embed_dim=768, depth=12,

num_heads=12, num_classes=num_classes

)

model = vit_base()

x = torch.randn(2, 3, 224, 224)

out = model(x)

params = sum(p.numel() for p in model.parameters())

print(f"ViT-Base output: {out.shape}, parameters: {params:,}")

4. Object Detection: YOLO

class YOLOHead(nn.Module):

"""Simplified YOLO detection head"""

def __init__(self, in_channels, num_anchors, num_classes):

super(YOLOHead, self).__init__()

self.num_anchors = num_anchors

self.num_classes = num_classes

Predict: (x, y, w, h, objectness, num_classes) * num_anchors

out_channels = num_anchors * (5 + num_classes)

self.head = nn.Sequential(

nn.Conv2d(in_channels, in_channels * 2, kernel_size=3, padding=1),

nn.BatchNorm2d(in_channels * 2),

nn.LeakyReLU(0.1),

nn.Conv2d(in_channels * 2, out_channels, kernel_size=1)

)

def forward(self, x):

out = self.head(x)

B, C, H, W = out.shape

out = out.reshape(B, self.num_anchors, 5 + self.num_classes, H, W)

out = out.permute(0, 1, 3, 4, 2).contiguous()

return out

5. Image Segmentation: U-Net

class DoubleConv(nn.Module):

"""U-Net double convolution block"""

def __init__(self, in_channels, out_channels):

super(DoubleConv, self).__init__()

self.double_conv = nn.Sequential(

nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),

nn.BatchNorm2d(out_channels),

nn.ReLU(inplace=True),

nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),

nn.BatchNorm2d(out_channels),

nn.ReLU(inplace=True)

)

def forward(self, x):

return self.double_conv(x)

class UNet(nn.Module):

"""U-Net for medical image segmentation"""

def __init__(self, in_channels=1, num_classes=2, features=[64, 128, 256, 512]):

super(UNet, self).__init__()

self.encoders = nn.ModuleList()

self.decoders = nn.ModuleList()

self.pool = nn.MaxPool2d(2, 2)

Encoder path

for feature in features:

self.encoders.append(DoubleConv(in_channels, feature))

in_channels = feature

Bottleneck

self.bottleneck = DoubleConv(features[-1], features[-1] * 2)

Decoder path

for feature in reversed(features):

self.decoders.append(

nn.ConvTranspose2d(feature * 2, feature, kernel_size=2, stride=2)

)

self.decoders.append(DoubleConv(feature * 2, feature))

self.final_conv = nn.Conv2d(features[0], num_classes, kernel_size=1)

def forward(self, x):

skip_connections = []

Encoder

for encoder in self.encoders:

x = encoder(x)

skip_connections.append(x)

x = self.pool(x)

x = self.bottleneck(x)

skip_connections = skip_connections[::-1]

Decoder

for i in range(0, len(self.decoders), 2):

x = self.decoders[i](x)

skip = skip_connections[i // 2]

if x.shape != skip.shape:

x = F.interpolate(x, size=skip.shape[2:])

x = torch.cat([skip, x], dim=1) # Skip connection

x = self.decoders[i + 1](x)

return self.final_conv(x)

model = UNet(in_channels=1, num_classes=2)

x = torch.randn(4, 1, 572, 572)

out = model(x)

print(f"U-Net output: {out.shape}") # (4, 2, 572, 572)

6. Transfer Learning in Practice

Using torchvision.models

from tqdm import tqdm

Load pretrained models

model_resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

model_efficientnet = models.efficientnet_b4(weights=models.EfficientNet_B4_Weights.DEFAULT)

model_vit = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)

def feature_extraction(num_classes, freeze=True):

"""Feature extraction: freeze backbone, train only classifier"""

model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

if freeze:

for param in model.parameters():

param.requires_grad = False

Replace classifier

in_features = model.fc.in_features

model.fc = nn.Sequential(

nn.Dropout(0.5),

nn.Linear(in_features, 256),

nn.ReLU(),

nn.Linear(256, num_classes)

)

for param in model.fc.parameters():

param.requires_grad = True

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)

total = sum(p.numel() for p in model.parameters())

print(f"Trainable parameters: {trainable:,} / {total:,} ({trainable/total:.1%})")

return model

def fine_tuning(num_classes, unfreeze_layers=2):

"""Fine-tuning: unfreeze last few layers"""

model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

for param in model.parameters():

param.requires_grad = False

layers = [model.layer4, model.avgpool, model.fc]

for layer in layers[-unfreeze_layers:]:

for param in layer.parameters():

param.requires_grad = True

model.fc = nn.Linear(model.fc.in_features, num_classes)

return model

def train_model(model, train_loader, val_loader, epochs=10,

learning_rate=1e-3, device='cuda'):

model = model.to(device)

criterion = nn.CrossEntropyLoss()

backbone_params = [p for n, p in model.named_parameters()

if 'fc' not in n and p.requires_grad]

head_params = [p for n, p in model.named_parameters()

if 'fc' in n and p.requires_grad]

optimizer = optim.AdamW([

{'params': backbone_params, 'lr': learning_rate * 0.1},

{'params': head_params, 'lr': learning_rate}

], weight_decay=1e-4)

scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

best_val_acc = 0.0

for epoch in range(epochs):

model.train()

train_correct, train_total = 0, 0

for images, labels in tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}'):

images, labels = images.to(device), labels.to(device)

optimizer.zero_grad()

outputs = model(images)

loss = criterion(outputs, labels)

loss.backward()

nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

optimizer.step()

train_correct += (outputs.argmax(1) == labels).sum().item()

train_total += images.size(0)

model.eval()

val_correct, val_total = 0, 0

with torch.no_grad():

for images, labels in val_loader:

images, labels = images.to(device), labels.to(device)

outputs = model(images)

val_correct += (outputs.argmax(1) == labels).sum().item()

val_total += images.size(0)

scheduler.step()

val_acc = val_correct / val_total

print(f"Epoch {epoch+1}: Train={train_correct/train_total:.4f}, Val={val_acc:.4f}")

if val_acc > best_val_acc:

best_val_acc = val_acc

torch.save(model.state_dict(), 'best_model.pt')

print(f"Best validation accuracy: {best_val_acc:.4f}")

return model

Data augmentation

from torchvision import transforms

def get_transforms(image_size=224):

train_transforms = transforms.Compose([

transforms.RandomResizedCrop(image_size),

transforms.RandomHorizontalFlip(),

transforms.RandomRotation(15),

transforms.ColorJitter(brightness=0.2, contrast=0.2,

saturation=0.2, hue=0.1),

transforms.ToTensor(),

transforms.Normalize(mean=[0.485, 0.456, 0.406],

std=[0.229, 0.224, 0.225])

])

val_transforms = transforms.Compose([

transforms.Resize(int(image_size * 1.14)),

transforms.CenterCrop(image_size),

transforms.ToTensor(),

transforms.Normalize(mean=[0.485, 0.456, 0.406],

std=[0.229, 0.224, 0.225])

])

return train_transforms, val_transforms

Architecture Performance Comparison

| --------------- | ---- | -------------- | ---------- | ----- |

| LeNet-5 | 1998 | ~99% (MNIST) | 60K | - |

| AlexNet | 2012 | 56.5% | 61M | 724M |

| VGG-16 | 2014 | 71.6% | 138M | 15.5G |

| GoogLeNet | 2014 | 68.7% | 6.8M | 1.5G |

| ResNet-50 | 2015 | 75.3% | 25M | 4.1G |

| DenseNet-121 | 2017 | 74.4% | 8M | 2.9G |

| MobileNetV2 | 2018 | 71.8% | 3.4M | 300M |

| EfficientNet-B0 | 2019 | 77.1% | 5.3M | 390M |

| ConvNeXt-T | 2022 | 82.1% | 28M | 4.5G |

| ViT-B/16 | 2020 | 81.8% | 86M | 17.6G |

Conclusion

CNN architectures have undergone remarkable evolution:

- **LeNet** (1998): First practical CNN, establishing the foundational structure

- **AlexNet** (2012): Deep learning renaissance, introduced ReLU and Dropout

- **VGGNet** (2014): The power of 3x3 convolutions, proving depth matters

- **ResNet** (2015): Residual connections solved the vanishing gradient problem

- **DenseNet** (2017): Dense connections maximized feature reuse

- **MobileNet** (2017): Depthwise separable convolutions enabled mobile deployment

- **EfficientNet** (2019): Compound scaling achieved state-of-the-art efficiency

- **ConvNeXt** (2022): Modernized CNN design with Transformer-inspired principles

- **ViT** (2020): Treating images as sequences opened a new paradigm

In practice, start from torchvision's pretrained models and apply transfer learning to quickly adapt to your target task.

References

- [PyTorch Vision Models](https://pytorch.org/vision/stable/models.html)

- ResNet paper: He et al., "Deep Residual Learning for Image Recognition" (arXiv:1512.03385)

- EfficientNet paper: Tan and Le, "EfficientNet: Rethinking Model Scaling" (arXiv:1905.11946)

- ViT paper: Dosovitskiy et al., "An Image is Worth 16x16 Words" (arXiv:2010.11929)

- ConvNeXt paper: Liu et al., "A ConvNet for the 2020s" (arXiv:2201.03545)

CNN Architecture Complete Guide

1. CNN Fundamentals

Understanding Convolution Intuitively

Visualize convolution

Edge detection kernel

Kernel, Stride, and Padding

Basic Conv2d parameters

Output size formula

H_out = floor((H_in + 2*padding - kernel_size) / stride + 1)

Parameter count

Conv2d: (kernel_h * kernel_w * in_channels + 1) * out_channels

Pooling (Max, Average, Global)

Max Pooling

Average Pooling

Global Average Pooling (GAP) - collapses spatial dimensions to 1x1

Adaptive Pooling - specify output size

Receptive Field Calculation

VGG-style (3x3 convolutions only)

Note: two 3x3 convs = same receptive field as one 5x5

But parameters are 2*(9*C^2) vs 25*C^2 — two 3x3s use 28% fewer params

2. CNN Architecture History

LeNet-5 (1998, LeCun) — The First Practical CNN

C1: 1@32x32 -> 6@28x28

S2: 6@28x28 -> 6@14x14

C3: 6@14x14 -> 16@10x10

S4: 16@10x10 -> 16@5x5

C5: 16@5x5 -> 120@1x1

AlexNet (2012, Krizhevsky) — The Deep Learning Renaissance

Layer 1: 3@224x224 -> 96@55x55

Layer 2: 96@27x27 -> 256@27x27

Layer 3

Layer 4

Layer 5: 384@13x13 -> 256@13x13

VGGNet (2014, Simonyan) — The Power of Depth

GoogLeNet/Inception (2014, Szegedy) — Multi-scale Parallel Processing

1x1 branch

1x1 bottleneck + 3x3

1x1 bottleneck + 5x5

MaxPool + 1x1

ResNet (2015, He) — Solving the Vanishing Gradient with Residual Connections

1x1 (reduce channels)

3x3 (spatial processing)

1x1 (expand channels: out_channels * 4)

Stem

4 stages

Classifier

Test

DenseNet (2017, Huang) — Dense Connectivity

Bottleneck: 1x1 conv to limit channels

3x3 conv

Concat all previous feature maps

MobileNet (2017) — Lightweight for Edge Devices

Depthwise: process each input channel independently

Pointwise: 1x1 conv to combine channels

Parameter savings

EfficientNet (2019, Tan) — Compound Scaling

EfficientNet scaling coefficients

(width_coeff, depth_coeff, resolution, dropout_rate)

ConvNeXt (2022, Liu) — A ConvNet for the 2020s

Depthwise Conv with large kernel (7x7)

LayerNorm

Inverted Bottleneck (4x channel expansion)

Layer Scale

(N, C, H, W) -> (N, H, W, C) for LayerNorm

(N, H, W, C) -> (N, C, H, W)

3. Vision Transformer (ViT)

Single convolution performs patch extraction and embedding

CLS token + positional embedding

Transformer blocks

4. Object Detection: YOLO

Predict: (x, y, w, h, objectness, num_classes) * num_anchors

5. Image Segmentation: U-Net

Encoder path

Bottleneck

Decoder path

Encoder

Decoder

6. Transfer Learning in Practice

Using torchvision.models

Load pretrained models

But parameters are 2(9C^2) vs 25*C^2 — two 3x3s use 28% fewer params