ResNet Paper In-Depth Analysis: How Residual Connections Broke the Depth Barrier in Deep Learning

1. Paper Overview
2. Background: The Depth Dilemma
3. Discovery of the Degradation Problem
4. Core Idea: Residual Learning
5. Mathematical Analysis: Gradient Flow
6. Architecture Details
7. Experimental Results
8. Implementation Details
9. PyTorch Implementation
10. Pre-activation ResNet: Identity Mappings in Deep Residual Networks
11. Impact and Follow-up Research
12. Residual Connections in Modern Architectures
13. Limitations and Criticisms
14. Summary
15. References

1. Paper Overview

"Deep Residual Learning for Image Recognition" was published in 2015 by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun from Microsoft Research. It received the Best Paper Award at CVPR 2016 and has accumulated over 200,000 citations as of 2025, making it one of the most influential papers in the history of deep learning.

The problem this paper solved is straightforward: "If deeper networks are supposed to perform better, why do they actually perform worse?" The Residual Learning Framework, proposed as the answer to this simple question, successfully trained a 152-layer network that achieved a top-5 error rate of 3.57% on the ImageNet ILSVRC 2015 Classification Task, securing first place. This figure significantly undercuts the human recognition error rate (approximately 5.1%).

ResNet did not merely win the Image Classification track. That same year, it swept first place across five tracks: ImageNet Detection, ImageNet Localization, COCO Detection, and COCO Segmentation. Furthermore, the Skip Connection (Shortcut Connection) proposed in this paper has since become a core component in virtually every modern deep learning architecture, including Transformer, BERT, GPT, and Diffusion Models.

2. Background: The Depth Dilemma

2.1 The Era of Deep Networks

After AlexNet (8 layers) won the ImageNet Challenge in 2012, the deep learning community followed the intuition that "deeper networks = better performance." This intuition was largely correct.

Year	Model	Layers	Top-5 Error (%)
2012	AlexNet	8	16.4
2013	ZFNet	8	14.8
2014	VGGNet	19	7.3
2014	GoogLeNet (Inception v1)	22	6.7
2015	ResNet	152	3.57

VGGNet (2014) increased depth to 19 layers using only 3x3 convolutions, while GoogLeNet used a parallel structure called the Inception Module to construct a 22-layer network. Both models experimentally demonstrated that "depth is critical for performance."

2.2 Lessons and Limitations of VGGNet

VGGNet established an important principle in architecture design: stacking multiple small filters (3x3) instead of a single large filter (5x5, 7x7) achieves the same receptive field while reducing the number of parameters and inserting more nonlinear activation functions to increase expressiveness.

However, 19 layers was effectively VGGNet's limit. The performance gains from VGG-16 to VGG-19 were already diminishing, and going deeper actually degraded performance. The parameter count was also problematic -- VGG-19 required approximately 144 million parameters and 19.6 billion FLOPs of computation.

2.3 GoogLeNet (Inception) Approach

GoogLeNet tackled the depth problem differently from VGGNet. It designed Inception Modules that perform 1x1, 3x3, and 5x5 convolutions in parallel, using 1x1 convolutions to reduce channel counts and save computation. Despite having only 22 layers, it achieved a lower error rate than VGGNet with fewer parameters (approximately 5 million).

However, the complex structure of Inception Modules had scalability limitations. Simply increasing the number of layers was insufficient to push performance further.

2.4 The Fundamental Question

At this point, the community faced a fundamental question:

"Is there a way to freely increase the depth of networks?"

The answer to this question is ResNet.

3. Discovery of the Degradation Problem

3.1 Deeper != Better

One of the most important contributions of the ResNet paper is clearly defining and experimentally demonstrating the degradation problem.

Intuitively, if you add identity mapping layers on top of a shallow network, the resulting deeper network should perform at least as well as the shallow one. The added layers only need to pass the input through unchanged. Therefore, the training error of a deeper network should never be higher than that of its shallower counterpart.

However, reality was different. The paper observed that on both CIFAR-10 and ImageNet, a 56-layer Plain Network (a standard network without shortcuts) had higher training error than a 20-layer model. This is not an overfitting problem. If it were overfitting, the training error would be low while only the validation error would be high. The fact that training error itself is higher means that optimization itself is difficult.

3.2 Difference from Vanishing/Exploding Gradients

The degradation problem is a different phenomenon from vanishing or exploding gradients.

Vanishing/Exploding Gradients have been largely addressed by techniques such as Batch Normalization and He Initialization. In fact, the plain networks in the paper already employed these techniques, and the networks did converge. The problem was that the converged performance was lower than that of shallower networks.

\text{Training Error}_{56\text{-layer plain}} > \text{Training Error}_{20\text{-layer plain}}

This phenomenon suggests that even when gradients propagate well, learning identity mappings through a stack of nonlinear layers is inherently very difficult.

3.3 Construction Argument

The paper presents a key argument called the Construction Argument:

Suppose there is a shallow network $A$ .
Add identity mapping layers on top of $A$ to create a deeper network $B$ .
$B$ should have at least the same performance as $A$ (since the added layers are identity functions).
Therefore, the training error of the deeper network $B$ cannot be higher than that of $A$ .

But in actual experiments, the training error of $B$ is higher than $A$ . This means that current SGD-based optimizers fail to find such solutions. The problem lies not in the model's representational capacity but in the optimization difficulty.

4. Core Idea: Residual Learning

4.1 Core Intuition

If the cause of the degradation problem is that "learning identity mappings is difficult," the solution is straightforward: explicitly embed identity mappings into the network.

Let the function that a block in a conventional network must learn be $\mathcal{H}(\mathbf{x})$ . The original goal is to learn this $\mathcal{H}(\mathbf{x})$ directly. If $\mathcal{H}(\mathbf{x}) = \mathbf{x}$ (identity mapping), learning this function through a stack of nonlinear layers is difficult.

The core idea of ResNet is to reformulate the problem so that the network learns the residual instead of directly learning $\mathcal{H}(\mathbf{x})$ .

\mathcal{F}(\mathbf{x}) := \mathcal{H}(\mathbf{x}) - \mathbf{x}

Therefore:

\mathcal{H}(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}

If the optimal mapping is close to identity, driving $\mathcal{F}(\mathbf{x})$ to zero is much easier than making $\mathcal{H}(\mathbf{x})$ identity. Initializing or learning the weights of nonlinear layers to be close to zero is a natural operation.

4.2 Structure of the Residual Block

A Residual Block has the following structure:

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

Where:

$\mathbf{x}$ : Input to the block
$\mathcal{F}(\mathbf{x}, \{W_i\})$ : The residual function to be learned (2-3 convolution layers)
$\mathbf{y}$ : Output of the block

The addition ( $+$ ) in $\mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$ is performed as element-wise addition, called a Shortcut Connection or Skip Connection. This operation requires no additional parameters and incurs negligible computational overhead.

For a Residual Block with two layers:

\mathcal{F} = W_2 \sigma(W_1 \mathbf{x})

Where $\sigma$ is the ReLU activation function. Bias terms are omitted for notational convenience.

4.3 Handling Dimension Mismatch

When the dimensions of $\mathcal{F}(\mathbf{x})$ and $\mathbf{x}$ differ (at downsampling stages where the number of feature map channels changes), they cannot be added directly. To address this, the paper uses a Linear Projection:

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s \mathbf{x}

Where $W_s$ is a projection matrix to match dimensions. The paper experimented with three options:

Option A: Match dimensions with zero-padding (no additional parameters)
Option B: Use 1x1 convolution projection only when dimensions change
Option C: Use 1x1 convolution projection for all shortcuts

Experimental results showed that all three options were vastly superior to Plain Networks, with minimal differences between options. This demonstrates that the projection is not the key to solving degradation -- the identity shortcut itself is the key. The final ResNet adopted Option B for memory and computational efficiency.

5. Mathematical Analysis: Gradient Flow

5.1 Forward Propagation

Let us analyze forward propagation through Residual Blocks. If the output of the $l$ -th Residual Block is $\mathbf{x}_l$ :

\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l, W_l)

Expanding this recursively, the output at any deep layer $L$ is:

\mathbf{x}_L = \mathbf{x}_l + \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, W_i)

The significance of this equation is profound. The feature at any deep layer $L$ is expressed as the feature from a shallow layer $l$ plus the sum of all residual functions in between. In a plain network, this would be a chain of matrix multiplications, whereas in ResNet it takes the form of addition.

5.2 Backward Propagation and the Gradient Highway

Now let us examine the crucial backward propagation. If the loss is $\mathcal{L}$ , by the chain rule:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot \frac{\partial \mathbf{x}_L}{\partial \mathbf{x}_l}

Substituting the forward equation derived earlier:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot \left(1 + \frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, W_i)\right)

The key here is the constant term 1. Gradients can flow directly from the loss to any layer through this path. Even if the term $\frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, W_i)$ becomes arbitrarily small, the constant 1 ensures that gradients never completely vanish.

This is the principle by which ResNet forms a Gradient Highway. In a plain network, gradients must be multiplied through the weight matrices of all layers:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \prod_{i=l}^{L-1} \frac{\partial \mathbf{x}_{i+1}}{\partial \mathbf{x}_i} \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L}

In this product form, even if each factor is only slightly less than 1, the gradient decreases exponentially. In ResNet's additive form, this problem does not arise.

5.3 Why Learning the "Residual" is Easier

For a more intuitive mathematical explanation: if the optimal transformation at each layer makes only small modifications to the input (a natural assumption for deep networks), the residual function $\mathcal{F}(\mathbf{x})$ should output values close to zero.

Since weight matrices are initialized close to zero, each Residual Block performs a mapping close to identity at the beginning of training. This can be interpreted as the deep network behaving like a shallow network early in training, with each block gradually learning useful transformations over time.

6. Architecture Details

6.1 Overall Structure

ResNet is based on VGGNet's design philosophy with the addition of Shortcut Connections. All ResNet variants share the following common structure:

conv1: 7x7 Convolution, stride 2, 64 filters, BatchNorm, ReLU
Max Pooling: 3x3, stride 2
conv2_x through conv5_x: Stacks of Residual Blocks
Global Average Pooling: Reduces feature maps to 1x1
Fully Connected Layer: 1000-class Softmax

When the feature map size is halved (at the first block of conv3_x, conv4_x, conv5_x), the number of channels doubles. Downsampling is performed via stride-2 convolution.

6.2 Basic Block (ResNet-18, ResNet-34)

The Basic Block consists of two 3x3 convolutions.

Input (C channels)
  |
  |---> 3x3 Conv, C filters, BN, ReLU
  |     3x3 Conv, C filters, BN
  |
  +---> (Identity Shortcut)
  |
  + <-- Element-wise Addition
  |
  ReLU
  |
Output (C channels)

Batch Normalization is applied after each convolution, and ReLU is applied after the addition.

6.3 Bottleneck Block (ResNet-50, ResNet-101, ResNet-152)

For ResNets with 50 or more layers, a Bottleneck structure is used for computational efficiency. It consists of three convolutions (1x1, 3x3, 1x1).

Input (4C channels)
  |
  |---> 1x1 Conv, C filters, BN, ReLU    (Channel reduction: 4C -> C)
  |     3x3 Conv, C filters, BN, ReLU    (Spatial processing)
  |     1x1 Conv, 4C filters, BN         (Channel restoration: C -> 4C)
  |
  +---> (Identity Shortcut)
  |
  + <-- Element-wise Addition
  |
  ReLU
  |
Output (4C channels)

The key idea of the Bottleneck is to reduce the channel count to 1/4 using 1x1 convolution, perform the expensive 3x3 convolution, then restore the channels with another 1x1 convolution. Thanks to this structure, ResNet-50 is deeper than ResNet-34 but maintains a similar level of FLOPs.

6.4 Architecture Comparison Table

Layer	Output Size	ResNet-18	ResNet-34	ResNet-50	ResNet-101	ResNet-152
conv1	112x112	7x7, 64, stride 2	7x7, 64, stride 2	7x7, 64, stride 2	7x7, 64, stride 2	7x7, 64, stride 2
pool	56x56	3x3 max pool, stride 2	3x3 max pool, stride 2	3x3 max pool, stride 2	3x3 max pool, stride 2	3x3 max pool, stride 2
conv2_x	56x56	[3x3, 64] x2	[3x3, 64] x3	[1x1, 64; 3x3, 64; 1x1, 256] x3	[1x1, 64; 3x3, 64; 1x1, 256] x3	[1x1, 64; 3x3, 64; 1x1, 256] x3
conv3_x	28x28	[3x3, 128] x2	[3x3, 128] x4	[1x1, 128; 3x3, 128; 1x1, 512] x4	[1x1, 128; 3x3, 128; 1x1, 512] x4	[1x1, 128; 3x3, 128; 1x1, 512] x8
conv4_x	14x14	[3x3, 256] x2	[3x3, 256] x6	[1x1, 256; 3x3, 256; 1x1, 1024] x6	[1x1, 256; 3x3, 256; 1x1, 1024] x23	[1x1, 256; 3x3, 256; 1x1, 1024] x36
conv5_x	7x7	[3x3, 512] x2	[3x3, 512] x3	[1x1, 512; 3x3, 512; 1x1, 2048] x3	[1x1, 512; 3x3, 512; 1x1, 2048] x3	[1x1, 512; 3x3, 512; 1x1, 2048] x3
	1x1	Global Average Pool, 1000-d FC, Softmax

6.5 Parameter Count and Computational Cost

Model	Layers	Parameters	FLOPs
VGG-19	19	144M	19.6B
ResNet-18	18	11.7M	1.8B
ResNet-34	34	21.8M	3.6B
ResNet-50	50	25.6M	3.8B
ResNet-101	101	44.5M	7.6B
ResNet-152	152	60.2M	11.3B

A notable observation is that ResNet-152 is 8 times deeper than VGG-19 yet requires fewer FLOPs and less than half the parameters. This is because VGGNet uses the majority of its parameters in the final Fully Connected layers, whereas ResNet uses Global Average Pooling to dramatically reduce FC layer parameters.

7. Experimental Results

7.1 ImageNet Classification

Confirming Degradation in Plain Networks

First, the paper confirmed the degradation problem in Plain Networks without shortcuts.

Model	Top-1 Error (%)	Top-5 Error (%)
Plain-18	27.94	-
Plain-34	28.54	-

The 34-layer Plain Network shows an error rate 0.6% higher than the 18-layer version. This is the degradation problem.

Effect of Residual Networks

Adding Shortcut Connections to the same architecture:

Model	Top-1 Error (%)	Top-5 Error (%)
ResNet-18	27.88	-
ResNet-34	25.03	-

ResNet-34 achieved an error rate 2.85% lower than ResNet-18. The degradation problem observed in Plain Networks was completely resolved, with clear performance improvements as depth increased.

Bottleneck ResNet Results (10-crop Testing)

Model	Top-1 Error (%)	Top-5 Error (%)
ResNet-50	22.85	6.71
ResNet-101	21.75	6.05
ResNet-152	21.43	5.71

ResNet-152's single-model Top-5 Error was 4.49% (Multi-scale, Multi-crop), and an ensemble of 6 models achieved 3.57%, securing first place in the ImageNet ILSVRC 2015 Classification track.

Comparison with VGG and GoogLeNet

Model	Top-5 Error (%)	Ensemble Top-5 Error (%)
VGG-16	7.3	-
GoogLeNet	6.7	-
ResNet-152 (single model)	4.49	-
ResNet Ensemble (6 models)	-	3.57

7.2 CIFAR-10 Experiments

The degradation problem was also confirmed and ResNet's effectiveness validated on the CIFAR-10 dataset (32x32 images, 10 classes). The CIFAR-10 ResNet differs from the ImageNet version: the first layer is a 3x3 convolution, and it uses {n, n, n} Residual Blocks across 3 stages (with feature map sizes of 32x32, 16x16, and 8x8 respectively).

Model	Layers	Error (%)
ResNet-20	20	8.75
ResNet-32	32	7.51
ResNet-44	44	7.17
ResNet-56	56	6.97
ResNet-110	110	6.43
ResNet-1202	1202	7.93

Performance improved consistently up to 110 layers. The 1202-layer network had low training error but higher test error than 110 layers, which the paper attributed to overfitting. With 19.4M parameters, it was excessively large relative to the small dataset (50,000 training images). The paper did not apply regularization techniques (such as Dropout) and noted that applying them could yield improvements.

7.3 COCO Object Detection and Segmentation

ResNet's effectiveness was validated beyond Image Classification in Object Detection and Segmentation.

PASCAL VOC and COCO Detection

Replacing Faster R-CNN's backbone from VGG-16 to ResNet-101:

COCO Detection: 6.0% improvement in mAP@[.5, .95] over VGG-16 (a relative 28% improvement)
1st place in ILSVRC 2015 Detection Task
1st place in COCO 2015 Detection Task

COCO Segmentation

1st place in COCO 2015 Segmentation Task

These results demonstrated that ResNet is not merely a classification-specific model but serves as a general-purpose feature extractor providing strong performance across diverse vision tasks.

8. Implementation Details

8.1 He Initialization

Kaiming He, the first author of the ResNet paper, had already proposed a weight initialization method suitable for ReLU networks prior to ResNet ("Delving Deep into Rectifiers", He et al., 2015).

W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)

Where $n_{in}$ is the number of input units for the layer. The variance is set to $\frac{2}{n_{in}}$ to account for ReLU zeroing out half of the input (the negative portion). Xavier Initialization ( $\frac{1}{n_{in}}$ ) is appropriate for Sigmoid/Tanh but causes variance to diminish progressively with ReLU.

He Initialization maintains consistent output variance across layers, preventing signal vanishing or explosion during the forward pass.

8.2 Batch Normalization

ResNet applies Batch Normalization (BN) after every convolution layer.

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

y_i = \gamma \hat{x}_i + \beta

Where:

$\mu_B$ , $\sigma_B^2$ : Mean and variance of the mini-batch
$\gamma$ , $\beta$ : Learnable scale and shift parameters
$\epsilon$ : A small constant for numerical stability

The roles of BN are:

Mitigating Internal Covariate Shift: Stabilizes training by normalizing the input distribution of each layer
Regularization Effect: Mini-batch-level normalization adds slight noise, acting as a regularizer
Enabling Higher Learning Rates: Stabilized distributions allow the use of higher learning rates

In ResNet, the ordering of BN is Conv -> BN -> ReLU (post-activation). This ordering was later improved in the follow-up Pre-activation ResNet.

8.3 Training Schedule

Training configuration on ImageNet:

Hyperparameter	Value
Optimizer	SGD with Momentum
Momentum	0.9
Weight Decay	0.0001
Batch Size	256
Initial Learning Rate	0.1
LR Schedule	Divided by 10 every 30 epochs
Total Epochs	~90
Data Augmentation	Random Crop (224x224), Horizontal Flip, Color Jittering
Preprocessing	Per-pixel Mean Subtraction

During training, images are randomly resized within the [256, 480] range and then Random Cropped to 224x224. At test time, 10-crop testing (4 corners + center + horizontal flips of each) is used, and for multi-scale testing, fully convolutional inference is performed at sizes 640.

8.4 Absence of Dropout

Interestingly, ResNet does not use Dropout. Batch Normalization provides sufficient regularization, and the Bottleneck structure inherently limits the number of parameters. Global Average Pooling also dramatically reduces FC layer parameters, lowering the risk of overfitting.

9. PyTorch Implementation

9.1 Basic Block

import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    """Basic Residual Block used in ResNet-18 and ResNet-34"""
    expansion = 1  # output channels = input channels * expansion

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        # First 3x3 Conv (downsampling possible via stride)
        self.conv1 = nn.Conv2d(
            in_channels, out_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        # Second 3x3 Conv
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3,
            stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Shortcut: 1x1 Conv projection when dimensions differ
        self.downsample = downsample

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        # Shortcut Connection
        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  # F(x) + x
        out = self.relu(out)

        return out

9.2 Bottleneck Block

class Bottleneck(nn.Module):
    """Bottleneck Block used in ResNet-50, ResNet-101, ResNet-152"""
    expansion = 4  # output channels = mid channels * 4

    def __init__(self, in_channels, mid_channels, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        out_channels = mid_channels * self.expansion

        # 1x1 Conv: Channel reduction (Squeeze)
        self.conv1 = nn.Conv2d(
            in_channels, mid_channels, kernel_size=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(mid_channels)

        # 3x3 Conv: Spatial processing (downsampling possible via stride)
        self.conv2 = nn.Conv2d(
            mid_channels, mid_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(mid_channels)

        # 1x1 Conv: Channel restoration (Expand)
        self.conv3 = nn.Conv2d(
            mid_channels, out_channels, kernel_size=1, bias=False
        )
        self.bn3 = nn.BatchNorm2d(out_channels)

        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

    def forward(self, x):
        identity = x

        # 1x1 -> BN -> ReLU
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        # 3x3 -> BN -> ReLU
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        # 1x1 -> BN
        out = self.conv3(out)
        out = self.bn3(out)

        # Shortcut Connection
        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  # F(x) + x
        out = self.relu(out)

        return out

9.3 Full ResNet Model

class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=1000):
        """
        Args:
            block: BasicBlock or Bottleneck
            layers: Number of blocks per stage [conv2_x, conv3_x, conv4_x, conv5_x]
            num_classes: Number of classification classes
        """
        super(ResNet, self).__init__()
        self.in_channels = 64

        # conv1: 7x7, stride 2
        self.conv1 = nn.Conv2d(
            3, 64, kernel_size=7, stride=2, padding=3, bias=False
        )
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # conv2_x through conv5_x
        self.layer1 = self._make_layer(block, 64, layers[0], stride=1)
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

        # Classification Head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        # He Initialization
        self._initialize_weights()

    def _make_layer(self, block, mid_channels, num_blocks, stride):
        downsample = None
        out_channels = mid_channels * block.expansion

        # Downsampling needed at the first block
        if stride != 1 or self.in_channels != out_channels:
            downsample = nn.Sequential(
                nn.Conv2d(
                    self.in_channels, out_channels,
                    kernel_size=1, stride=stride, bias=False
                ),
                nn.BatchNorm2d(out_channels),
            )

        layers = []
        layers.append(block(self.in_channels, mid_channels, stride, downsample))
        self.in_channels = out_channels

        for _ in range(1, num_blocks):
            layers.append(block(self.in_channels, mid_channels))

        return nn.Sequential(*layers)

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # He Initialization (fan_out mode)
                nn.init.kaiming_normal_(
                    m.weight, mode='fan_out', nonlinearity='relu'
                )
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        # Stem
        x = self.conv1(x)       # 224x224 -> 112x112
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)     # 112x112 -> 56x56

        # Residual Stages
        x = self.layer1(x)      # 56x56
        x = self.layer2(x)      # 28x28
        x = self.layer3(x)      # 14x14
        x = self.layer4(x)      # 7x7

        # Classification Head
        x = self.avgpool(x)     # 1x1
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x


# Model creation functions
def resnet18(num_classes=1000):
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)

def resnet34(num_classes=1000):
    return ResNet(BasicBlock, [3, 4, 6, 3], num_classes)

def resnet50(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)

def resnet101(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 23, 3], num_classes)

def resnet152(num_classes=1000):
    return ResNet(Bottleneck, [3, 8, 36, 3], num_classes)

9.4 Usage Example

# Create ResNet-50 and perform a Forward Pass
model = resnet50(num_classes=1000)
x = torch.randn(1, 3, 224, 224)  # Batch=1, RGB, 224x224
output = model(x)
print(f"Output shape: {output.shape}")  # torch.Size([1, 1000])

# Check parameter count
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
# ResNet-50: approximately 25,557,032 (25.6M)

# Using PyTorch official pre-trained model
import torchvision.models as models
resnet50_pretrained = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

10. Pre-activation ResNet: Identity Mappings in Deep Residual Networks

10.1 Motivation for the Follow-up Paper

Shortly after the ResNet paper was published, the same authors (He et al.) presented "Identity Mappings in Deep Residual Networks" at ECCV 2016. This paper demonstrated that rearranging the order of operations within a Residual Block enables more effective training of even deeper networks (1001 layers).

10.2 Original vs Pre-activation

Original ResNet (Post-activation):

\mathbf{x}_{l+1} = \text{ReLU}(\mathcal{F}(\mathbf{x}_l) + \mathbf{x}_l)

In this structure, since ReLU is positioned after the addition, signals passing through the shortcut path are also affected by ReLU. This prevents the identity mapping from being a true identity, weakening the Gradient Highway effect.

Pre-activation ResNet:

\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\text{BN}(\text{ReLU}(\mathbf{x}_l)))

The operation order is changed to BN -> ReLU -> Conv -> BN -> ReLU -> Conv. This ensures the shortcut path becomes a pure identity mapping.

10.3 Structure Comparison

[Original ResNet]                [Pre-activation ResNet]
Input --+                        Input --+
        |                                |
     Conv                             BN
        |                                |
      BN                             ReLU
        |                                |
     ReLU                            Conv
        |                                |
     Conv                             BN
        |                                |
      BN                             ReLU
        |                                |
   (+) <-+ shortcut                   Conv
        |                                |
     ReLU                           (+) <-+ shortcut
        |                                |
     Output                          Output

10.4 Mathematical Advantages

For the pre-activation structure, forward propagation is:

\mathbf{x}_L = \mathbf{x}_l + \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i)

Backward propagation:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot \left(1 + \frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i)\right)

Since the shortcut is a pure identity, the constant term 1 is exactly preserved. In the original ResNet, the intervening ReLU causes this constant term to deviate from exactly 1.

10.5 Experimental Results

Model	CIFAR-10 Error (%)	CIFAR-100 Error (%)
ResNet-110 (original)	6.43	-
ResNet-1001 (original)	~7.61	-
ResNet-1001 (pre-activation)	4.62	22.71

The pre-activation structure showed particularly significant performance improvements in very deep networks (1001 layers) compared to the original structure. This experimentally confirmed that pure identity mappings are critically important for gradient flow.

10.6 PyTorch Implementation

class PreActBasicBlock(nn.Module):
    """Pre-activation Basic Block (BN -> ReLU -> Conv)"""
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(PreActBasicBlock, self).__init__()
        self.bn1 = nn.BatchNorm2d(in_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv1 = nn.Conv2d(
            in_channels, out_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3,
            stride=1, padding=1, bias=False
        )
        self.downsample = downsample

    def forward(self, x):
        identity = x

        # Pre-activation: BN -> ReLU -> Conv
        out = self.bn1(x)
        out = self.relu(out)

        if self.downsample is not None:
            identity = self.downsample(out)

        out = self.conv1(out)
        out = self.bn2(out)
        out = self.relu(out)
        out = self.conv2(out)

        out += identity
        return out

11. Impact and Follow-up Research

The Residual Learning paradigm proposed by ResNet became the foundation for numerous subsequent architectural innovations. Let us examine the major follow-up works.

11.1 ResNeXt (2017)

"Aggregated Residual Transformations for Deep Neural Networks" - Xie et al., Facebook AI Research

ResNeXt introduced a new dimension called Cardinality (number of groups) to ResNet's Residual Block. It performs multiple transformation paths in parallel within a single block, then sums them.

\mathcal{F}(\mathbf{x}) = \sum_{i=1}^{C} \mathcal{T}_i(\mathbf{x})

Where $C$ is the cardinality (e.g., 32) and $\mathcal{T}_i$ is the transformation in each path. In practice, this is efficiently implemented using Grouped Convolution.

ResNeXt-101 (32x4d) achieved higher accuracy than ResNet-101 with the same computational cost, demonstrating that cardinality is a more effective dimension than width (channel count) or depth (layer count).

11.2 DenseNet (2017)

"Densely Connected Convolutional Networks" - Huang et al., Cornell/Facebook

DenseNet extends Residual Connections to the extreme. Each layer is directly connected to all preceding layers. While ResNet uses element-wise addition, DenseNet uses channel-wise concatenation.

\mathbf{x}_l = \mathcal{H}_l([\mathbf{x}_0, \mathbf{x}_1, ..., \mathbf{x}_{l-1}])

This structure maximizes feature reuse and improves parameter efficiency. DenseNet-121 achieved comparable performance to ResNet-50 with fewer parameters.

11.3 SENet (2018)

"Squeeze-and-Excitation Networks" - Hu et al., Momenta

SENet added an SE Module to Residual Blocks that models inter-channel relationships. It learns the importance of each channel and recalibrates the weights accordingly.

\mathbf{s} = \sigma(\mathbf{W}_2 \cdot \text{ReLU}(\mathbf{W}_1 \cdot \text{GAP}(\mathbf{x})))

\tilde{\mathbf{x}} = \mathbf{s} \odot \mathbf{x}

Where GAP is Global Average Pooling, $\sigma$ is Sigmoid, and $\odot$ is channel-wise multiplication. SENet won first place in ILSVRC 2017 Classification with a Top-5 Error of 2.251%.

11.4 EfficientNet (2019)

"EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" - Tan & Le, Google Brain

EfficientNet proposed Compound Scaling, which scales width, depth, and resolution simultaneously in a balanced manner. It is based on MBConv (Mobile Inverted Bottleneck) blocks, which also use Residual Connections.

\text{depth}: d = \alpha^\phi, \quad \text{width}: w = \beta^\phi, \quad \text{resolution}: r = \gamma^\phi

\text{s.t.} \quad \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2

EfficientNet-B7 achieved state-of-the-art ImageNet accuracy at the time with 8.4 times fewer parameters than ResNet.

11.5 ConvNeXt (2022)

"A ConvNet for the 2020s" - Liu et al., Facebook AI Research

ConvNeXt applied Vision Transformer (ViT) design principles to CNNs to create a "modernized ResNet." Starting from ResNet-50, the following changes were applied sequentially:

Modernized Training Recipe (300 epochs, AdamW, Mixup, Cutmix, etc.)
Stage Ratio Change: (3, 4, 6, 3) -> (3, 3, 9, 3)
"Patchify" Stem: 7x7 Conv -> 4x4 Conv, stride 4
ResNeXt-style Grouped Convolution
Inverted Bottleneck
Large Kernel Size: 3x3 -> 7x7 Depthwise Conv
Activation Function: ReLU -> GELU
Normalization: BN -> Layer Normalization

ConvNeXt-T achieved comparable performance to Swin-T (82.1% top-1 accuracy), demonstrating that pure CNN architectures can compete with Transformers. This research reaffirmed how robust ResNet's design foundation is.

12. Residual Connections in Modern Architectures

12.1 Skip Connections in Transformers

The Transformer architecture proposed in Vaswani et al.'s "Attention Is All You Need" (2017) uses Residual Connections in every sub-layer.

\text{Output} = \text{LayerNorm}(\mathbf{x} + \text{SubLayer}(\mathbf{x}))

Where SubLayer is either Multi-Head Attention or a Feed-Forward Network. For the same reasons proven in ResNet, training deep Transformers without these Residual Connections is virtually impossible.

12.2 Pre-LayerNorm and Post-LayerNorm

The Transformer world has a similar debate to ResNet's pre-activation vs. post-activation.

Post-LayerNorm (original Transformer):

\mathbf{x}_{l+1} = \text{LN}(\mathbf{x}_l + \text{SubLayer}(\mathbf{x}_l))

Pre-LayerNorm (standard since GPT-2):

\mathbf{x}_{l+1} = \mathbf{x}_l + \text{SubLayer}(\text{LN}(\mathbf{x}_l))

Pre-LayerNorm operates on the same principle as ResNet's pre-activation, keeping the shortcut path as a pure identity to improve gradient flow. Most state-of-the-art Large Language Models including GPT-2 and GPT-3 use Pre-LayerNorm.

12.3 Residual Connections in Diffusion Models

The U-Net architecture of Denoising Diffusion Probabilistic Models (DDPM) also uses Skip Connections in each Residual Block. Long skip connections between the U-Net's encoder and decoder are combined with residual skip connections within blocks, enabling effective utilization of features at various scales.

12.4 Vision Transformer (ViT)

ViT (Vision Transformer) divides images into 16x16 patches and feeds them into a Transformer Encoder. Each Transformer block naturally uses Residual Connections, and without them, training ViTs with more than 12 layers becomes difficult.

12.5 Key Lessons

The most important legacy of ResNet is not a specific architecture but the design principle of Residual Connections. This principle can be summarized as follows:

Set identity mapping as the default: Even if the network learns nothing, it should at least be able to pass the input through unchanged.
Ensure a Gradient Highway: Create paths through which gradients can flow directly from the loss to every layer.
Depth is a free dimension: With Residual Connections, increasing network depth is always beneficial, or at least never detrimental.

This principle applies universally regardless of architecture type, spanning CNNs, Transformers, Diffusion Models, State Space Models, and beyond.

13. Limitations and Criticisms

13.1 Inefficiency of Feature Reuse

According to the study by Veit et al. (2016), "Residual Networks Behave Like Ensembles of Relatively Shallow Networks," most layers in ResNet transmit information through very short paths (shallow paths), and the contribution of very deep paths is minimal. This raises questions about whether all 152 layers are being utilized efficiently.

13.2 Element-wise Addition of Feature Maps

The authors of DenseNet argued that ResNet's element-wise addition can cause information loss. They contended that concatenation-based DenseNet enables more efficient feature reuse. However, concatenation suffers from the problem of rapidly increasing memory usage.

13.3 Computational Overhead

While parameter counts were reduced through Global Average Pooling and Bottleneck structures, the actual inference speed of very deep ResNets (ResNet-152) is not necessarily faster than VGGNet. Memory access costs and sequential dependencies can become practical bottlenecks.

14. Summary

ResNet is one of the most important papers in the history of deep learning. Its contributions can be summarized as follows:

Discovery and Definition of the Degradation Problem: Clearly identified the phenomenon where training error increases in deeper networks.
Residual Learning Framework: The $\mathcal{F}(\mathbf{x}) + \mathbf{x}$ structure, which explicitly includes identity mapping, enabled successful training of networks with hundreds of layers.
Gradient Highway Theory: Presented the mathematical mechanism by which Skip Connections directly propagate gradients.
Bottleneck Structure: Achieved both depth and efficiency through channel reduction/restoration using 1x1 convolutions.
Overwhelming Experimental Results: Dominated all existing methods on ImageNet (3.57% top-5 error), CIFAR-10, and COCO Detection/Segmentation.
Universal Design Principle: Residual Connections have become an essential element in all major modern deep learning architectures, including Transformers and Diffusion Models.

A single simple addition operation ( $+ \mathbf{x}$ ) broke through the depth barrier of deep learning and laid the foundation for a decade of AI advancement. ResNet is a quintessential example demonstrating that sometimes the simplest ideas are the most powerful.

15. References

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity Mappings in Deep Residual Networks. ECCV 2016. arXiv:1603.05027
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015. arXiv:1502.01852
Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015. arXiv:1409.1556
Szegedy, C., et al. (2015). Going Deeper with Convolutions. CVPR 2015. arXiv:1409.4842
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015. arXiv:1502.03167
Xie, S., et al. (2017). Aggregated Residual Transformations for Deep Neural Networks. CVPR 2017. arXiv:1611.05431
Huang, G., et al. (2017). Densely Connected Convolutional Networks. CVPR 2017. arXiv:1608.06993
Hu, J., et al. (2018). Squeeze-and-Excitation Networks. CVPR 2018. arXiv:1709.01507
Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019. arXiv:1905.11946
Liu, Z., et al. (2022). A ConvNet for the 2020s. CVPR 2022. arXiv:2201.03545
Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
Veit, A., et al. (2016). Residual Networks Behave Like Ensembles of Relatively Shallow Networks. NeurIPS 2016. arXiv:1605.06431
KaimingHe/deep-residual-networks. GitHub Repository
ILSVRC 2015 Results. ImageNet Challenge