Split View: ResNet 논문 완벽 분석: 잔차 연결(Residual Connection)이 딥러닝의 깊이 한계를 돌파한 방법

ResNet 논문 완벽 분석: 잔차 연결(Residual Connection)이 딥러닝의 깊이 한계를 돌파한 방법

1. 논문 개요
2. 논문의 배경: 깊이의 딜레마
3. Degradation 문제의 발견
4. 핵심 아이디어: Residual Learning
5. 수학적 분석: Gradient Flow
6. 아키텍처 상세
7. 실험 결과
8. 구현 디테일
9. PyTorch 구현
10. Pre-activation ResNet: Identity Mappings in Deep Residual Networks
11. ResNet의 영향과 후속 연구
12. 현대 아키텍처에서의 Residual Connection
13. 한계와 비판
14. 요약
15. References

1. 논문 개요

"Deep Residual Learning for Image Recognition"은 2015년 Microsoft Research의 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun이 발표한 논문이다. CVPR 2016에서 Best Paper Award를 수상했으며, 2025년 기준 인용 수 20만 회 이상을 기록한 딥러닝 역사상 가장 영향력 있는 논문 중 하나다.

이 논문이 해결한 문제는 명확하다. "네트워크가 깊어질수록 성능이 좋아져야 하는데, 왜 오히려 나빠지는가?" 이 단순한 질문에 대한 답으로 제안된 Residual Learning Framework는 152층짜리 네트워크를 성공적으로 훈련시켜 ImageNet ILSVRC 2015 Classification Task에서 top-5 error 3.57%로 1위를 차지했다. 이는 인간의 인식 오류율(약 5.1%)을 크게 밑도는 수치다.

ResNet은 단순히 Image Classification에서 우승한 것에 그치지 않았다. 같은 해 ImageNet Detection, ImageNet Localization, COCO Detection, COCO Segmentation까지 총 5개 트랙에서 1위를 휩쓸었다. 그리고 이 논문이 제안한 Skip Connection(Shortcut Connection)은 이후 Transformer, BERT, GPT, Diffusion Model 등 현대 딥러닝의 거의 모든 아키텍처에 핵심 구성 요소로 자리 잡았다.

2. 논문의 배경: 깊이의 딜레마

2.1 깊은 네트워크의 시대

2012년 AlexNet(8층)이 ImageNet Challenge에서 우승한 이후, 딥러닝 커뮤니티는 "더 깊은 네트워크 = 더 좋은 성능"이라는 직관을 따라왔다. 실제로 이 직관은 상당 부분 맞았다.

연도	모델	층 수	Top-5 Error (%)
2012	AlexNet	8	16.4
2013	ZFNet	8	14.8
2014	VGGNet	19	7.3
2014	GoogLeNet (Inception v1)	22	6.7
2015	ResNet	152	3.57

VGGNet(2014)은 3x3 Convolution만을 사용해 19층까지 깊이를 늘렸고, GoogLeNet은 Inception Module이라는 병렬 구조를 활용해 22층의 네트워크를 구성했다. 두 모델 모두 "깊이가 성능에 결정적"이라는 것을 실험적으로 보여줬다.

2.2 VGGNet의 교훈과 한계

VGGNet은 아키텍처 설계에서 중요한 원칙을 확립했다. 큰 필터(5x5, 7x7) 하나 대신 작은 필터(3x3) 여러 개를 쌓으면 같은 Receptive Field를 확보하면서도 파라미터 수를 줄이고, 중간에 더 많은 비선형 활성화 함수를 넣어 표현력을 높일 수 있다는 것이다.

그러나 VGGNet은 19층이 사실상의 한계였다. VGG-16에서 VGG-19로 갈 때 이미 성능 향상 폭이 크게 줄었고, 그 이상 깊이를 늘리면 오히려 성능이 나빠졌다. 파라미터 수도 문제였다. VGG-19는 약 1.44억 개의 파라미터와 19.6 billion FLOPs의 연산량을 요구했다.

2.3 GoogLeNet(Inception)의 접근

GoogLeNet은 VGGNet과 다른 방향으로 깊이 문제를 접근했다. 1x1, 3x3, 5x5 Convolution을 병렬로 수행하는 Inception Module을 설계하고, 1x1 Convolution으로 채널 수를 줄여 연산량을 절감했다. 22층에 불과하지만 VGGNet보다 적은 파라미터(약 500만 개)로 더 낮은 Error Rate을 달성했다.

그러나 Inception Module의 복잡한 구조는 확장성에 한계가 있었다. 층 수를 단순히 늘리는 것으로는 성능을 더 끌어올리기 어려웠다.

2.4 근본적 질문

이 시점에서 커뮤니티가 직면한 근본적 질문은 이것이었다.

"네트워크의 깊이를 자유롭게 늘릴 수 있는 방법은 없는가?"

이 질문에 대한 답이 바로 ResNet이다.

3. Degradation 문제의 발견

3.1 Deeper != Better

ResNet 논문의 가장 중요한 기여 중 하나는 Degradation 문제를 명확하게 정의하고 실험적으로 입증한 것이다.

직관적으로 생각하면, 얕은 네트워크에 Identity Mapping 층을 추가하면 최소한 얕은 네트워크와 동일한 성능은 나와야 한다. 추가된 층이 아무것도 하지 않고 입력을 그대로 출력하기만 하면 되기 때문이다. 따라서 더 깊은 네트워크의 Training Error는 얕은 네트워크보다 높을 수 없어야 한다.

그러나 현실은 달랐다. 논문에서 CIFAR-10과 ImageNet 모두에서 Plain Network(Shortcut 없는 일반 네트워크)의 56층 모델이 20층 모델보다 Training Error가 더 높은 현상을 관찰했다. 이는 Overfitting 문제가 아니다. Overfitting이라면 Training Error는 낮고 Validation Error만 높아야 한다. Training Error 자체가 높다는 것은 최적화(Optimization) 자체가 어렵다는 의미다.

3.2 Vanishing/Exploding Gradient와의 차이

Degradation 문제는 Vanishing Gradient나 Exploding Gradient 문제와는 다른 현상이다.

Vanishing/Exploding Gradient는 Batch Normalization, He Initialization 등의 기법으로 상당 부분 해결되었다. 실제로 논문의 Plain Network에도 이러한 기법이 적용되어 있었고, 네트워크가 수렴하기는 했다. 문제는 수렴한 지점의 성능이 얕은 네트워크보다 낮다는 것이다.

\text{Training Error}_{56\text{-layer plain}} > \text{Training Error}_{20\text{-layer plain}}

이 현상은 Gradient가 잘 전파되더라도, 깊은 네트워크에서 Identity Mapping을 학습하는 것 자체가 비선형 층의 스택으로는 매우 어렵다는 것을 시사한다.

3.3 Construction Argument

논문에서 제시하는 핵심 논증은 Construction Argument다. 다음과 같이 생각해보자.

얕은 네트워크 $A$ 가 있다고 하자.
$A$ 위에 Identity Mapping을 수행하는 층들을 추가해 깊은 네트워크 $B$ 를 만든다.
$B$ 는 $A$ 와 최소한 동일한 성능을 가져야 한다 (추가 층이 항등 함수이므로).
따라서 깊은 네트워크 $B$ 의 Training Error는 $A$ 보다 높을 수 없다.

하지만 실제 실험에서는 $B$ 의 Training Error가 $A$ 보다 높다. 이는 현재의 SGD 기반 Optimizer가 이러한 해를 찾지 못한다는 것을 의미한다. 문제는 모델의 표현력이 아니라 최적화 난이도에 있다.

4. 핵심 아이디어: Residual Learning

4.1 핵심 직관

Degradation 문제의 원인이 "Identity Mapping을 학습하기 어렵다"는 것이라면, 해결책은 간단하다. Identity Mapping을 명시적으로 네트워크에 내장하면 된다.

기존 네트워크의 한 블록이 학습해야 하는 함수를 $\mathcal{H}(\mathbf{x})$ 라 하자. 원래 목표는 이 $\mathcal{H}(\mathbf{x})$ 를 직접 학습하는 것이다. 만약 $\mathcal{H}(\mathbf{x}) = \mathbf{x}$ (Identity Mapping)라면, 이 함수를 비선형 층의 스택으로 학습하는 것은 어렵다.

ResNet의 핵심 아이디어는 $\mathcal{H}(\mathbf{x})$ 를 직접 학습하는 대신, **잔차(Residual)**를 학습하도록 재구성하는 것이다.

\mathcal{F}(\mathbf{x}) := \mathcal{H}(\mathbf{x}) - \mathbf{x}

따라서:

\mathcal{H}(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}

만약 최적의 매핑이 Identity에 가깝다면, $\mathcal{F}(\mathbf{x})$ 를 0으로 만드는 것이 $\mathcal{H}(\mathbf{x})$ 를 Identity로 만드는 것보다 훨씬 쉽다. 비선형 층의 가중치를 0으로 초기화하거나 0에 가깝게 학습시키는 것은 자연스러운 일이기 때문이다.

4.2 Residual Block의 구조

Residual Block은 다음과 같은 구조를 가진다.

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

여기서:

$\mathbf{x}$ : 블록의 입력
$\mathcal{F}(\mathbf{x}, \{W_i\})$ : 학습해야 하는 잔차 함수 (2~3개의 Convolution 층)
$\mathbf{y}$ : 블록의 출력

$\mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$ 에서 덧셈( $+$ )은 Element-wise Addition으로 수행되며, 이를 Shortcut Connection 또는 Skip Connection이라 부른다. 이 연산은 추가 파라미터가 전혀 필요 없으며 연산량 증가도 무시할 수 있는 수준이다.

2개의 층을 가진 Residual Block의 경우:

\mathcal{F} = W_2 \sigma(W_1 \mathbf{x})

여기서 $\sigma$ 는 ReLU 활성화 함수이다. Bias는 표기 편의상 생략했다.

4.3 Dimension Mismatch 처리

$\mathcal{F}(\mathbf{x})$ 와 $\mathbf{x}$ 의 차원이 다른 경우(Feature Map의 채널 수가 변하는 Downsampling 단계) 직접 더할 수 없다. 이를 해결하기 위해 논문은 Linear Projection을 사용한다.

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s \mathbf{x}

여기서 $W_s$ 는 차원을 맞추기 위한 Projection Matrix이다. 논문에서는 세 가지 옵션을 실험했다.

Option A: Zero-padding으로 차원을 맞춤 (파라미터 추가 없음)
Option B: 차원이 변할 때만 1x1 Convolution Projection 사용
Option C: 모든 Shortcut에 1x1 Convolution Projection 사용

실험 결과 세 옵션 모두 Plain Network보다 월등히 좋았으며, 옵션 간 차이는 미미했다. 이는 Projection이 Degradation 해결의 핵심이 아니라 Identity Shortcut 자체가 핵심이라는 것을 보여준다. 최종 ResNet에서는 메모리와 연산 효율을 위해 Option B를 채택했다.

5. 수학적 분석: Gradient Flow

5.1 Forward Propagation

Residual Block을 통한 Forward Propagation을 분석해보자. $l$ 번째 Residual Block의 출력을 $\mathbf{x}_l$ 이라 하면:

\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l, W_l)

이를 재귀적으로 전개하면, 임의의 깊은 층 $L$ 에서의 출력은:

\mathbf{x}_L = \mathbf{x}_l + \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, W_i)

이 수식이 의미하는 바는 매우 중요하다. 임의의 깊은 층 $L$ 의 Feature는 얕은 층 $l$ 의 Feature와, 그 사이의 모든 Residual Function의 합으로 표현된다. Plain Network에서는 이것이 행렬 곱의 연쇄(Chain of Matrix Multiplications)인 반면, ResNet에서는 덧셈의 형태로 나타난다.

5.2 Backward Propagation과 Gradient Highway

이제 핵심인 Backward Propagation을 살펴보자. Loss를 $\mathcal{L}$ 이라 하면, Chain Rule에 의해:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot \frac{\partial \mathbf{x}_L}{\partial \mathbf{x}_l}

앞서 유도한 Forward 수식을 대입하면:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot \left(1 + \frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, W_i)\right)

이 수식에서 핵심은 상수항 1이다. Gradient가 이 경로를 통해 어떤 층에서든 Loss까지 직접 흐를 수 있다. $\frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, W_i)$ 항이 아무리 작아져도, 상수항 1 덕분에 Gradient가 완전히 사라지지 않는다.

이것이 바로 ResNet이 Gradient Highway를 형성하는 원리다. Plain Network에서는 Gradient가 모든 층의 가중치 행렬을 곱해야 하므로:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \prod_{i=l}^{L-1} \frac{\partial \mathbf{x}_{i+1}}{\partial \mathbf{x}_i} \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L}

이 곱 형태에서는 각 항이 1보다 조금만 작아도 Gradient가 기하급수적으로 감소한다. 반면 ResNet의 합 형태에서는 이런 문제가 발생하지 않는다.

5.3 왜 "Residual"을 학습하는 것이 쉬운가

수학적으로 좀 더 직관적인 설명을 해보자. 네트워크의 각 층이 입력에 작은 변화만 가하는 것이 최적이라면(이는 깊은 네트워크에서 자연스러운 가정이다), 잔차 함수 $\mathcal{F}(\mathbf{x})$ 는 0에 가까운 값을 출력해야 한다.

가중치 행렬이 0에 가깝게 초기화되므로, 학습 초기에 각 Residual Block은 거의 Identity에 가까운 매핑을 수행한다. 이는 깊은 네트워크가 학습 초기에 얕은 네트워크처럼 동작하다가, 점진적으로 각 블록이 유용한 변환을 학습하는 과정으로 해석할 수 있다.

6. 아키텍처 상세

6.1 전체 구조 개요

ResNet은 VGGNet의 설계 철학을 기반으로 하되, Shortcut Connection을 추가한 구조다. 모든 ResNet 변형은 다음과 같은 공통 구조를 가진다.

conv1: 7x7 Convolution, stride 2, 64 filters, BatchNorm, ReLU
Max Pooling: 3x3, stride 2
conv2_x ~ conv5_x: Residual Block의 스택
Global Average Pooling: Feature Map을 1x1로 축소
Fully Connected Layer: 1000-class Softmax

Feature Map의 크기가 절반으로 줄어들 때(conv3_x, conv4_x, conv5_x의 첫 번째 블록) 채널 수는 두 배로 증가한다. Downsampling은 stride 2의 Convolution으로 수행한다.

6.2 Basic Block (ResNet-18, ResNet-34)

Basic Block은 2개의 3x3 Convolution으로 구성된다.

Input (C channels)
  |
  ├──→ 3x3 Conv, C filters, BN, ReLU
  |    3x3 Conv, C filters, BN
  |
  └──→ (Identity Shortcut)
  |
  + ←── Element-wise Addition
  |
  ReLU
  |
Output (C channels)

각 Convolution 뒤에는 Batch Normalization이 적용되며, ReLU는 Addition 이후에 적용된다.

6.3 Bottleneck Block (ResNet-50, ResNet-101, ResNet-152)

50층 이상의 ResNet에서는 연산 효율을 위해 Bottleneck 구조를 사용한다. 3개의 Convolution(1x1, 3x3, 1x1)으로 구성된다.

Input (4C channels)
  |
  ├──→ 1x1 Conv, C filters, BN, ReLU    (채널 축소: 4C → C)
  |    3x3 Conv, C filters, BN, ReLU    (공간적 처리)
  |    1x1 Conv, 4C filters, BN          (채널 복원: C → 4C)
  |
  └──→ (Identity Shortcut)
  |
  + ←── Element-wise Addition
  |
  ReLU
  |
Output (4C channels)

Bottleneck의 핵심 아이디어는 1x1 Convolution으로 채널 수를 1/4로 줄인 뒤 비용이 큰 3x3 Convolution을 수행하고, 다시 1x1 Convolution으로 채널을 복원하는 것이다. 이 구조 덕분에 ResNet-50은 ResNet-34보다 깊지만, 연산량(FLOPs)은 비슷한 수준을 유지한다.

6.4 아키텍처 비교표

Layer	Output Size	ResNet-18	ResNet-34	ResNet-50	ResNet-101	ResNet-152
conv1	112x112	7x7, 64, stride 2	7x7, 64, stride 2	7x7, 64, stride 2	7x7, 64, stride 2	7x7, 64, stride 2
pool	56x56	3x3 max pool, stride 2	3x3 max pool, stride 2	3x3 max pool, stride 2	3x3 max pool, stride 2	3x3 max pool, stride 2
conv2_x	56x56	[3x3, 64] x2	[3x3, 64] x3	[1x1, 64; 3x3, 64; 1x1, 256] x3	[1x1, 64; 3x3, 64; 1x1, 256] x3	[1x1, 64; 3x3, 64; 1x1, 256] x3
conv3_x	28x28	[3x3, 128] x2	[3x3, 128] x4	[1x1, 128; 3x3, 128; 1x1, 512] x4	[1x1, 128; 3x3, 128; 1x1, 512] x4	[1x1, 128; 3x3, 128; 1x1, 512] x8
conv4_x	14x14	[3x3, 256] x2	[3x3, 256] x6	[1x1, 256; 3x3, 256; 1x1, 1024] x6	[1x1, 256; 3x3, 256; 1x1, 1024] x23	[1x1, 256; 3x3, 256; 1x1, 1024] x36
conv5_x	7x7	[3x3, 512] x2	[3x3, 512] x3	[1x1, 512; 3x3, 512; 1x1, 2048] x3	[1x1, 512; 3x3, 512; 1x1, 2048] x3	[1x1, 512; 3x3, 512; 1x1, 2048] x3
	1x1	Global Average Pool, 1000-d FC, Softmax

6.5 파라미터 수와 연산량

Model	Layers	Parameters	FLOPs
VGG-19	19	144M	19.6B
ResNet-18	18	11.7M	1.8B
ResNet-34	34	21.8M	3.6B
ResNet-50	50	25.6M	3.8B
ResNet-101	101	44.5M	7.6B
ResNet-152	152	60.2M	11.3B

주목할 점은 ResNet-152가 VGG-19보다 8배 깊으면서도 연산량은 더 적고 파라미터는 절반 이하라는 것이다. 이는 VGGNet이 마지막 Fully Connected Layer에 파라미터의 대부분을 사용하는 반면, ResNet은 Global Average Pooling을 사용해 FC Layer의 파라미터를 극적으로 줄였기 때문이다.

7. 실험 결과

7.1 ImageNet Classification

Plain Network의 Degradation 확인

먼저 논문은 Shortcut이 없는 Plain Network에서 Degradation 문제를 확인했다.

Model	Top-1 Error (%)	Top-5 Error (%)
Plain-18	27.94	-
Plain-34	28.54	-

34층 Plain Network가 18층보다 오히려 0.6% 높은 Error Rate을 보인다. 이것이 바로 Degradation 문제다.

Residual Network의 효과

같은 구조에 Shortcut Connection만 추가한 결과:

Model	Top-1 Error (%)	Top-5 Error (%)
ResNet-18	27.88	-
ResNet-34	25.03	-

ResNet-34는 ResNet-18보다 2.85% 낮은 Error Rate을 달성했다. Plain Network에서 관찰된 Degradation 문제가 완전히 해소되었으며, 깊이 증가에 따른 성능 향상이 명확하게 나타났다.

Bottleneck ResNet 결과 (10-crop Testing)

Model	Top-1 Error (%)	Top-5 Error (%)
ResNet-50	22.85	6.71
ResNet-101	21.75	6.05
ResNet-152	21.43	5.71

ResNet-152의 단일 모델 Top-5 Error는 4.49%(Multi-scale, Multi-crop)이며, 6개 모델을 Ensemble한 결과는 **3.57%**로 ImageNet ILSVRC 2015 Classification에서 1위를 차지했다.

VGG 및 GoogLeNet과의 비교

Model	Top-5 Error (%)	Ensemble Top-5 Error (%)
VGG-16	7.3	-
GoogLeNet	6.7	-
ResNet-152 (single model)	4.49	-
ResNet Ensemble (6 models)	-	3.57

7.2 CIFAR-10 실험

CIFAR-10 데이터셋(32x32 이미지, 10 클래스)에서도 Degradation 문제를 확인하고 ResNet의 효과를 검증했다. CIFAR-10용 ResNet은 ImageNet 버전과 다르게 첫 번째 층이 3x3 Convolution이고, 3개의 Stage(각각 Feature Map 크기 32x32, 16x16, 8x8)에서 {n, n, n}개의 Residual Block을 사용한다.

Model	Layers	Error (%)
ResNet-20	20	8.75
ResNet-32	32	7.51
ResNet-44	44	7.17
ResNet-56	56	6.97
ResNet-110	110	6.43
ResNet-1202	1202	7.93

110층까지는 일관되게 성능이 향상되었다. 1202층 네트워크는 Training Error가 낮지만 Test Error는 110층보다 높은데, 논문에서는 이를 Overfitting으로 분석했다. 파라미터가 19.4M으로 작은 데이터셋(50,000개 학습 이미지)에 비해 과도하게 많기 때문이다. 논문은 Regularization(Dropout 등)을 적용하지 않았으며, 이를 적용하면 개선될 수 있다고 언급했다.

7.3 COCO Object Detection & Segmentation

ResNet의 효과는 Image Classification을 넘어 Object Detection과 Segmentation에서도 검증되었다.

PASCAL VOC & COCO Detection

Faster R-CNN의 Backbone을 VGG-16에서 ResNet-101로 교체한 결과:

COCO Detection: VGG-16 대비 mAP@[.5, .95] 기준 6.0% 향상 (상대적으로 28% 개선)
ILSVRC 2015 Detection Task 1위
COCO 2015 Detection Task 1위

COCO Segmentation

COCO 2015 Segmentation Task 1위

이 결과들은 ResNet이 단순한 Classification 전용 모델이 아니라, 범용적인 Feature Extractor로서 다양한 Vision Task에 강력한 성능을 제공한다는 것을 입증했다.

8. 구현 디테일

8.1 He Initialization

ResNet 논문의 저자인 Kaiming He는 ResNet 이전에 이미 ReLU 네트워크에 적합한 가중치 초기화 방법을 제안했다 ("Delving Deep into Rectifiers", He et al., 2015).

W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)

여기서 $n_{in}$ 은 해당 층의 입력 유닛 수다. ReLU가 입력의 절반(음수 부분)을 0으로 만드는 것을 고려해 분산을 $\frac{2}{n_{in}}$ 으로 설정한다. Xavier Initialization( $\frac{1}{n_{in}}$ )은 Sigmoid/Tanh에 적합하지만 ReLU에서는 분산이 점점 줄어드는 문제가 있다.

He Initialization은 각 층의 출력 분산을 일정하게 유지하여, Forward Pass에서 신호가 사라지거나 폭발하지 않도록 한다.

8.2 Batch Normalization

ResNet은 모든 Convolution 층 뒤에 **Batch Normalization(BN)**을 적용한다.

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

y_i = \gamma \hat{x}_i + \beta

여기서:

$\mu_B$ , $\sigma_B^2$ : Mini-batch의 평균과 분산
$\gamma$ , $\beta$ : 학습 가능한 Scale과 Shift 파라미터
$\epsilon$ : 수치 안정성을 위한 작은 상수

BN의 역할은 다음과 같다.

Internal Covariate Shift 완화: 각 층의 입력 분포를 정규화하여 학습을 안정화
Regularization 효과: Mini-batch 단위의 정규화가 약간의 Noise를 추가하여 Regularizer 역할
높은 Learning Rate 허용: 분포가 안정화되므로 더 높은 Learning Rate을 사용 가능

ResNet에서 BN의 배치 순서는 Conv → BN → ReLU (Post-activation)이다. 이 순서는 후속 연구(Pre-activation ResNet)에서 개선된다.

8.3 Training Schedule

ImageNet에서의 학습 설정:

Hyperparameter	Value
Optimizer	SGD with Momentum
Momentum	0.9
Weight Decay	0.0001
Batch Size	256
Initial Learning Rate	0.1
LR Schedule	30 epoch마다 1/10으로 감소
Total Epochs	~90
Data Augmentation	Random Crop (224x224), Horizontal Flip, Color Jittering
Preprocessing	Per-pixel Mean Subtraction

학습 시 이미지는 [256, 480] 범위에서 무작위 크기 조정 후 224x224로 Random Crop된다. 테스트 시에는 10-crop Testing(4 corners + center + 각각의 Horizontal Flip)을 사용하며, Multi-scale Testing에서는 {224, 256, 384, 480, 640} 크기로 Fully Convolutional 추론을 수행한다.

8.4 Dropout의 부재

흥미롭게도 ResNet은 Dropout을 사용하지 않는다. Batch Normalization이 충분한 Regularization 효과를 제공하고, Bottleneck 구조가 본질적으로 파라미터 수를 제한하기 때문이다. Global Average Pooling도 FC Layer의 파라미터를 크게 줄여 Overfitting 위험을 낮춘다.

9. PyTorch 구현

9.1 Basic Block

import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    """ResNet-18, ResNet-34에 사용되는 Basic Residual Block"""
    expansion = 1  # 출력 채널 = 입력 채널 * expansion

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        # 첫 번째 3x3 Conv (stride로 downsampling 가능)
        self.conv1 = nn.Conv2d(
            in_channels, out_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        # 두 번째 3x3 Conv
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3,
            stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Shortcut: 차원이 다를 때 1x1 Conv로 projection
        self.downsample = downsample

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        # Shortcut Connection
        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  # F(x) + x
        out = self.relu(out)

        return out

9.2 Bottleneck Block

class Bottleneck(nn.Module):
    """ResNet-50, ResNet-101, ResNet-152에 사용되는 Bottleneck Block"""
    expansion = 4  # 출력 채널 = 중간 채널 * 4

    def __init__(self, in_channels, mid_channels, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        out_channels = mid_channels * self.expansion

        # 1x1 Conv: 채널 축소 (Squeeze)
        self.conv1 = nn.Conv2d(
            in_channels, mid_channels, kernel_size=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(mid_channels)

        # 3x3 Conv: 공간적 처리 (stride로 downsampling 가능)
        self.conv2 = nn.Conv2d(
            mid_channels, mid_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(mid_channels)

        # 1x1 Conv: 채널 복원 (Expand)
        self.conv3 = nn.Conv2d(
            mid_channels, out_channels, kernel_size=1, bias=False
        )
        self.bn3 = nn.BatchNorm2d(out_channels)

        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

    def forward(self, x):
        identity = x

        # 1x1 → BN → ReLU
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        # 3x3 → BN → ReLU
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        # 1x1 → BN
        out = self.conv3(out)
        out = self.bn3(out)

        # Shortcut Connection
        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  # F(x) + x
        out = self.relu(out)

        return out

9.3 전체 ResNet 모델

class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=1000):
        """
        Args:
            block: BasicBlock 또는 Bottleneck
            layers: 각 stage의 block 수 [conv2_x, conv3_x, conv4_x, conv5_x]
            num_classes: 분류 클래스 수
        """
        super(ResNet, self).__init__()
        self.in_channels = 64

        # conv1: 7x7, stride 2
        self.conv1 = nn.Conv2d(
            3, 64, kernel_size=7, stride=2, padding=3, bias=False
        )
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # conv2_x ~ conv5_x
        self.layer1 = self._make_layer(block, 64, layers[0], stride=1)
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

        # Classification Head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        # He Initialization
        self._initialize_weights()

    def _make_layer(self, block, mid_channels, num_blocks, stride):
        downsample = None
        out_channels = mid_channels * block.expansion

        # 첫 번째 block에서 downsampling이 필요한 경우
        if stride != 1 or self.in_channels != out_channels:
            downsample = nn.Sequential(
                nn.Conv2d(
                    self.in_channels, out_channels,
                    kernel_size=1, stride=stride, bias=False
                ),
                nn.BatchNorm2d(out_channels),
            )

        layers = []
        layers.append(block(self.in_channels, mid_channels, stride, downsample))
        self.in_channels = out_channels

        for _ in range(1, num_blocks):
            layers.append(block(self.in_channels, mid_channels))

        return nn.Sequential(*layers)

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # He Initialization (fan_out mode)
                nn.init.kaiming_normal_(
                    m.weight, mode='fan_out', nonlinearity='relu'
                )
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        # Stem
        x = self.conv1(x)       # 224x224 → 112x112
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)     # 112x112 → 56x56

        # Residual Stages
        x = self.layer1(x)      # 56x56
        x = self.layer2(x)      # 28x28
        x = self.layer3(x)      # 14x14
        x = self.layer4(x)      # 7x7

        # Classification Head
        x = self.avgpool(x)     # 1x1
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x


# 모델 생성 함수
def resnet18(num_classes=1000):
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)

def resnet34(num_classes=1000):
    return ResNet(BasicBlock, [3, 4, 6, 3], num_classes)

def resnet50(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)

def resnet101(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 23, 3], num_classes)

def resnet152(num_classes=1000):
    return ResNet(Bottleneck, [3, 8, 36, 3], num_classes)

9.4 사용 예시

# ResNet-50 생성 및 Forward Pass
model = resnet50(num_classes=1000)
x = torch.randn(1, 3, 224, 224)  # Batch=1, RGB, 224x224
output = model(x)
print(f"Output shape: {output.shape}")  # torch.Size([1, 1000])

# 파라미터 수 확인
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
# ResNet-50: 약 25,557,032 (25.6M)

# PyTorch 공식 Pre-trained 모델 사용
import torchvision.models as models
resnet50_pretrained = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

10. Pre-activation ResNet: Identity Mappings in Deep Residual Networks

10.1 후속 논문의 동기

ResNet 논문이 발표된 직후, 같은 저자들(He et al.)이 2016년 ECCV에서 "Identity Mappings in Deep Residual Networks"를 발표했다. 이 논문은 Residual Block 내부의 연산 순서를 재구성하여 더 깊은 네트워크(1001층)를 더 효과적으로 학습할 수 있음을 보였다.

10.2 Original vs Pre-activation

Original ResNet (Post-activation):

\mathbf{x}_{l+1} = \text{ReLU}(\mathcal{F}(\mathbf{x}_l) + \mathbf{x}_l)

이 구조에서 ReLU가 Addition 뒤에 위치하므로, Shortcut 경로를 통과하는 신호도 ReLU의 영향을 받는다. 이는 Identity Mapping이 완전한 Identity가 아니게 되어 Gradient Highway의 효과를 약화시킨다.

Pre-activation ResNet:

\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\text{BN}(\text{ReLU}(\mathbf{x}_l)))

연산 순서를 BN → ReLU → Conv → BN → ReLU → Conv로 변경한다. 이렇게 하면 Shortcut 경로가 순수한 Identity Mapping이 된다.

10.3 구조 비교

[Original ResNet]                [Pre-activation ResNet]
Input ──┐                        Input ──┐
        │                                │
     Conv                             BN
        │                                │
      BN                             ReLU
        │                                │
     ReLU                            Conv
        │                                │
     Conv                             BN
        │                                │
      BN                             ReLU
        │                                │
   (+) ←┘ shortcut                   Conv
        │                                │
     ReLU                           (+) ←┘ shortcut
        │                                │
     Output                          Output

10.4 수학적 이점

Pre-activation 구조에서 Forward Propagation은:

\mathbf{x}_L = \mathbf{x}_l + \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i)

Backward Propagation은:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot \left(1 + \frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i)\right)

Shortcut이 순수한 Identity이므로 상수항 1이 정확히 보존된다. Original ResNet에서는 ReLU가 끼어들어 이 상수항이 정확히 1이 아니게 된다.

10.5 실험 결과

Model	CIFAR-10 Error (%)	CIFAR-100 Error (%)
ResNet-110 (original)	6.43	-
ResNet-1001 (original)	~7.61	-
ResNet-1001 (pre-activation)	4.62	22.71

Pre-activation 구조는 특히 **매우 깊은 네트워크(1001층)**에서 원래 구조 대비 큰 폭의 성능 향상을 보였다. 이는 순수한 Identity Mapping이 Gradient Flow에 결정적으로 중요하다는 것을 실험적으로 확인한 결과다.

10.6 PyTorch 구현

class PreActBasicBlock(nn.Module):
    """Pre-activation Basic Block (BN → ReLU → Conv)"""
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(PreActBasicBlock, self).__init__()
        self.bn1 = nn.BatchNorm2d(in_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv1 = nn.Conv2d(
            in_channels, out_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3,
            stride=1, padding=1, bias=False
        )
        self.downsample = downsample

    def forward(self, x):
        identity = x

        # Pre-activation: BN → ReLU → Conv
        out = self.bn1(x)
        out = self.relu(out)

        if self.downsample is not None:
            identity = self.downsample(out)

        out = self.conv1(out)
        out = self.bn2(out)
        out = self.relu(out)
        out = self.conv2(out)

        out += identity
        return out

11. ResNet의 영향과 후속 연구

ResNet이 제안한 Residual Learning 패러다임은 이후 수많은 아키텍처 혁신의 기반이 되었다. 주요 후속 연구들을 살펴보자.

11.1 ResNeXt (2017)

"Aggregated Residual Transformations for Deep Neural Networks" - Xie et al., Facebook AI Research

ResNeXt는 ResNet의 Residual Block에 Cardinality(그룹 수)라는 새로운 차원을 도입했다. 하나의 블록 안에서 여러 개의 변환 경로를 병렬로 수행한 뒤 합산한다.

\mathcal{F}(\mathbf{x}) = \sum_{i=1}^{C} \mathcal{T}_i(\mathbf{x})

여기서 $C$ 는 Cardinality(예: 32), $\mathcal{T}_i$ 는 각 경로의 변환이다. 구현에서는 Grouped Convolution으로 효율적으로 처리된다.

ResNeXt-101 (32x4d)는 ResNet-101과 같은 연산량으로 더 높은 정확도를 달성했으며, 이는 Width(채널 수)와 Depth(층 수)보다 Cardinality가 더 효과적인 차원임을 보여줬다.

11.2 DenseNet (2017)

"Densely Connected Convolutional Networks" - Huang et al., Cornell/Facebook

DenseNet은 Residual Connection을 극단으로 확장한 아키텍처다. 각 층이 이전의 모든 층과 직접 연결된다. ResNet이 Element-wise Addition을 사용하는 반면, DenseNet은 Channel-wise Concatenation을 사용한다.

\mathbf{x}_l = \mathcal{H}_l([\mathbf{x}_0, \mathbf{x}_1, ..., \mathbf{x}_{l-1}])

이 구조는 Feature Reuse를 극대화하고, 파라미터 효율성을 높인다. DenseNet-121은 ResNet-50보다 적은 파라미터로 비슷한 성능을 달성했다.

11.3 SENet (2018)

"Squeeze-and-Excitation Networks" - Hu et al., Momenta

SENet은 Residual Block에 채널 간 관계를 모델링하는 SE Module을 추가했다. 각 채널의 중요도를 학습하여 가중치를 재조정한다.

\mathbf{s} = \sigma(\mathbf{W}_2 \cdot \text{ReLU}(\mathbf{W}_1 \cdot \text{GAP}(\mathbf{x})))

\tilde{\mathbf{x}} = \mathbf{s} \odot \mathbf{x}

여기서 GAP는 Global Average Pooling, $\sigma$ 는 Sigmoid, $\odot$ 는 Channel-wise Multiplication이다. SENet은 ILSVRC 2017 Classification에서 Top-5 Error 2.251%로 1위를 차지했다.

11.4 EfficientNet (2019)

"EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" - Tan & Le, Google Brain

EfficientNet은 네트워크의 Width, Depth, Resolution을 동시에 균형 있게 스케일링하는 Compound Scaling 방법을 제안했다. MBConv(Mobile Inverted Bottleneck) 블록을 기반으로 하며, 이 블록 역시 Residual Connection을 사용한다.

\text{depth}: d = \alpha^\phi, \quad \text{width}: w = \beta^\phi, \quad \text{resolution}: r = \gamma^\phi

\text{s.t.} \quad \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2

EfficientNet-B7은 ResNet보다 8.4배 적은 파라미터로 당시 최고 수준의 ImageNet 정확도를 달성했다.

11.5 ConvNeXt (2022)

"A ConvNet for the 2020s" - Liu et al., Facebook AI Research

ConvNeXt는 Vision Transformer(ViT)의 설계 원칙을 CNN에 적용해 "현대화된 ResNet"을 만든 연구다. ResNet-50을 시작점으로 다음과 같은 변경을 순차적으로 적용했다.

Training Recipe 현대화 (300 epoch, AdamW, Mixup, Cutmix 등)
Stage Ratio 변경: (3, 4, 6, 3) → (3, 3, 9, 3)
"Patchify" Stem: 7x7 Conv → 4x4 Conv, stride 4
ResNeXt-style Grouped Convolution
Inverted Bottleneck
Large Kernel Size: 3x3 → 7x7 Depthwise Conv
활성화 함수: ReLU → GELU
정규화: BN → Layer Normalization

ConvNeXt-T는 Swin-T와 동등한 성능(82.1% top-1 accuracy)을 달성하며, 순수 CNN 아키텍처도 Transformer와 경쟁할 수 있음을 보여줬다. 이 연구는 ResNet의 설계가 얼마나 견고한 기반인지를 재확인시켜 줬다.

12. 현대 아키텍처에서의 Residual Connection

12.1 Transformer에서의 Skip Connection

Vaswani et al.의 "Attention Is All You Need" (2017)에서 제안된 Transformer 아키텍처는 모든 Sub-layer에 Residual Connection을 사용한다.

\text{Output} = \text{LayerNorm}(\mathbf{x} + \text{SubLayer}(\mathbf{x}))

여기서 SubLayer는 Multi-Head Attention 또는 Feed-Forward Network이다. ResNet에서 증명된 것과 동일한 이유로, 이 Residual Connection 없이는 깊은 Transformer를 학습시키는 것이 사실상 불가능하다.

12.2 Pre-LayerNorm과 Post-LayerNorm

Transformer에서도 ResNet의 Pre-activation vs Post-activation과 유사한 논쟁이 있다.

Post-LayerNorm (원본 Transformer):

\mathbf{x}_{l+1} = \text{LN}(\mathbf{x}_l + \text{SubLayer}(\mathbf{x}_l))

Pre-LayerNorm (GPT-2 이후 표준):

\mathbf{x}_{l+1} = \mathbf{x}_l + \text{SubLayer}(\text{LN}(\mathbf{x}_l))

Pre-LayerNorm은 ResNet의 Pre-activation과 동일한 원리로, Shortcut 경로를 순수한 Identity로 유지하여 Gradient Flow를 개선한다. GPT-2, GPT-3 등 대부분의 최신 Large Language Model은 Pre-LayerNorm을 사용한다.

12.3 Diffusion Model에서의 Residual Connection

Denoising Diffusion Probabilistic Model(DDPM)의 U-Net 아키텍처에서도 각 Residual Block에 Skip Connection이 사용된다. U-Net의 Encoder-Decoder 간 Long Skip Connection과 블록 내부의 Residual Skip Connection이 결합되어, 다양한 스케일의 Feature를 효과적으로 활용한다.

12.4 Vision Transformer (ViT)

ViT(Vision Transformer)는 이미지를 16x16 패치로 분할한 뒤 Transformer Encoder에 입력한다. 각 Transformer 블록은 당연히 Residual Connection을 사용하며, 이것이 없으면 12층 이상의 ViT를 학습시키기 어렵다.

12.5 핵심 교훈

ResNet이 남긴 가장 중요한 유산은 특정 아키텍처가 아니라, Residual Connection이라는 설계 원칙이다. 이 원칙은 다음과 같이 요약할 수 있다.

Identity Mapping을 기본값으로 설정하라: 네트워크가 아무것도 배우지 못하더라도 최소한 입력을 그대로 전달할 수 있어야 한다.
Gradient Highway를 확보하라: Loss에서 모든 층까지 Gradient가 직접 흐를 수 있는 경로를 만들어라.
깊이는 자유도다: Residual Connection이 있으면, 네트워크의 깊이를 늘리는 것이 항상 이득이거나 최소한 손해가 아니다.

이 원칙은 CNN, Transformer, Diffusion Model, State Space Model 등 아키텍처의 종류를 불문하고 보편적으로 적용된다.

13. 한계와 비판

13.1 Feature Reuse의 비효율성

Veit et al. (2016)의 "Residual Networks Behave Like Ensembles of Relatively Shallow Networks" 연구에 따르면, ResNet의 대부분의 층은 실제로 매우 짧은 경로(Shallow Path)를 통해 정보를 전달하며, 매우 깊은 경로의 기여도는 미미하다. 이는 152층 전체가 효율적으로 활용되고 있는지에 대한 의문을 제기한다.

13.2 Feature Map의 Element-wise Addition

DenseNet의 저자들은 ResNet의 Element-wise Addition이 정보를 손실시킬 수 있다고 주장했다. Concatenation 기반의 DenseNet이 더 효율적인 Feature Reuse를 가능하게 한다는 것이다. 다만, Concatenation은 메모리 사용량이 급격히 증가하는 문제가 있다.

13.3 Computational Overhead

Global Average Pooling과 Bottleneck 구조로 파라미터 수를 줄였지만, 매우 깊은 ResNet(ResNet-152)의 실제 추론 속도는 VGGNet보다 반드시 빠르지는 않다. Memory Access Cost와 Sequential Dependency가 실질적인 병목이 될 수 있다.

14. 요약

ResNet은 딥러닝 역사에서 가장 중요한 논문 중 하나다. 그 기여를 정리하면 다음과 같다.

Degradation 문제의 발견과 정의: 깊은 네트워크에서 Training Error가 증가하는 현상을 명확히 규명했다.
Residual Learning Framework: $\mathcal{F}(\mathbf{x}) + \mathbf{x}$ 구조로 Identity Mapping을 명시적으로 포함시켜, 수백 층의 네트워크를 성공적으로 학습시켰다.
Gradient Highway 이론: Skip Connection이 Gradient를 직접 전파시키는 수학적 메커니즘을 제시했다.
Bottleneck 구조: 1x1 Convolution을 활용한 채널 축소/복원으로 깊이와 효율성을 동시에 달성했다.
압도적 실험 결과: ImageNet(3.57% top-5 error), CIFAR-10, COCO Detection/Segmentation에서 기존 모든 방법을 압도했다.
범용적 설계 원칙: Residual Connection은 Transformer, Diffusion Model 등 현대 딥러닝의 모든 주요 아키텍처에 필수 요소로 자리 잡았다.

간단한 덧셈 연산 하나( $+ \mathbf{x}$ )가 딥러닝의 깊이 한계를 돌파하고, 이후 10년간의 AI 발전의 토대가 되었다. ResNet은 때로는 가장 단순한 아이디어가 가장 강력하다는 것을 보여주는 대표적 사례다.

15. References

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity Mappings in Deep Residual Networks. ECCV 2016. arXiv:1603.05027
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015. arXiv:1502.01852
Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015. arXiv:1409.1556
Szegedy, C., et al. (2015). Going Deeper with Convolutions. CVPR 2015. arXiv:1409.4842
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015. arXiv:1502.03167
Xie, S., et al. (2017). Aggregated Residual Transformations for Deep Neural Networks. CVPR 2017. arXiv:1611.05431
Huang, G., et al. (2017). Densely Connected Convolutional Networks. CVPR 2017. arXiv:1608.06993
Hu, J., et al. (2018). Squeeze-and-Excitation Networks. CVPR 2018. arXiv:1709.01507
Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019. arXiv:1905.11946
Liu, Z., et al. (2022). A ConvNet for the 2020s. CVPR 2022. arXiv:2201.03545
Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
Veit, A., et al. (2016). Residual Networks Behave Like Ensembles of Relatively Shallow Networks. NeurIPS 2016. arXiv:1605.06431
KaimingHe/deep-residual-networks. GitHub Repository
ILSVRC 2015 Results. ImageNet Challenge

ResNet Paper In-Depth Analysis: How Residual Connections Broke the Depth Barrier in Deep Learning

1. Paper Overview
2. Background: The Depth Dilemma
3. Discovery of the Degradation Problem
4. Core Idea: Residual Learning
5. Mathematical Analysis: Gradient Flow
6. Architecture Details
7. Experimental Results
8. Implementation Details
9. PyTorch Implementation
10. Pre-activation ResNet: Identity Mappings in Deep Residual Networks
11. Impact and Follow-up Research
12. Residual Connections in Modern Architectures
13. Limitations and Criticisms
14. Summary
15. References

1. Paper Overview

"Deep Residual Learning for Image Recognition" was published in 2015 by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun from Microsoft Research. It received the Best Paper Award at CVPR 2016 and has accumulated over 200,000 citations as of 2025, making it one of the most influential papers in the history of deep learning.

The problem this paper solved is straightforward: "If deeper networks are supposed to perform better, why do they actually perform worse?" The Residual Learning Framework, proposed as the answer to this simple question, successfully trained a 152-layer network that achieved a top-5 error rate of 3.57% on the ImageNet ILSVRC 2015 Classification Task, securing first place. This figure significantly undercuts the human recognition error rate (approximately 5.1%).

ResNet did not merely win the Image Classification track. That same year, it swept first place across five tracks: ImageNet Detection, ImageNet Localization, COCO Detection, and COCO Segmentation. Furthermore, the Skip Connection (Shortcut Connection) proposed in this paper has since become a core component in virtually every modern deep learning architecture, including Transformer, BERT, GPT, and Diffusion Models.

2. Background: The Depth Dilemma

2.1 The Era of Deep Networks

After AlexNet (8 layers) won the ImageNet Challenge in 2012, the deep learning community followed the intuition that "deeper networks = better performance." This intuition was largely correct.

Year	Model	Layers	Top-5 Error (%)
2012	AlexNet	8	16.4
2013	ZFNet	8	14.8
2014	VGGNet	19	7.3
2014	GoogLeNet (Inception v1)	22	6.7
2015	ResNet	152	3.57

VGGNet (2014) increased depth to 19 layers using only 3x3 convolutions, while GoogLeNet used a parallel structure called the Inception Module to construct a 22-layer network. Both models experimentally demonstrated that "depth is critical for performance."

2.2 Lessons and Limitations of VGGNet

VGGNet established an important principle in architecture design: stacking multiple small filters (3x3) instead of a single large filter (5x5, 7x7) achieves the same receptive field while reducing the number of parameters and inserting more nonlinear activation functions to increase expressiveness.

However, 19 layers was effectively VGGNet's limit. The performance gains from VGG-16 to VGG-19 were already diminishing, and going deeper actually degraded performance. The parameter count was also problematic -- VGG-19 required approximately 144 million parameters and 19.6 billion FLOPs of computation.

2.3 GoogLeNet (Inception) Approach

GoogLeNet tackled the depth problem differently from VGGNet. It designed Inception Modules that perform 1x1, 3x3, and 5x5 convolutions in parallel, using 1x1 convolutions to reduce channel counts and save computation. Despite having only 22 layers, it achieved a lower error rate than VGGNet with fewer parameters (approximately 5 million).

However, the complex structure of Inception Modules had scalability limitations. Simply increasing the number of layers was insufficient to push performance further.

2.4 The Fundamental Question

At this point, the community faced a fundamental question:

"Is there a way to freely increase the depth of networks?"

The answer to this question is ResNet.

3. Discovery of the Degradation Problem

3.1 Deeper != Better

One of the most important contributions of the ResNet paper is clearly defining and experimentally demonstrating the degradation problem.

Intuitively, if you add identity mapping layers on top of a shallow network, the resulting deeper network should perform at least as well as the shallow one. The added layers only need to pass the input through unchanged. Therefore, the training error of a deeper network should never be higher than that of its shallower counterpart.

However, reality was different. The paper observed that on both CIFAR-10 and ImageNet, a 56-layer Plain Network (a standard network without shortcuts) had higher training error than a 20-layer model. This is not an overfitting problem. If it were overfitting, the training error would be low while only the validation error would be high. The fact that training error itself is higher means that optimization itself is difficult.

3.2 Difference from Vanishing/Exploding Gradients

The degradation problem is a different phenomenon from vanishing or exploding gradients.

Vanishing/Exploding Gradients have been largely addressed by techniques such as Batch Normalization and He Initialization. In fact, the plain networks in the paper already employed these techniques, and the networks did converge. The problem was that the converged performance was lower than that of shallower networks.

\text{Training Error}_{56\text{-layer plain}} > \text{Training Error}_{20\text{-layer plain}}

This phenomenon suggests that even when gradients propagate well, learning identity mappings through a stack of nonlinear layers is inherently very difficult.

3.3 Construction Argument

The paper presents a key argument called the Construction Argument:

Suppose there is a shallow network $A$ .
Add identity mapping layers on top of $A$ to create a deeper network $B$ .
$B$ should have at least the same performance as $A$ (since the added layers are identity functions).
Therefore, the training error of the deeper network $B$ cannot be higher than that of $A$ .

But in actual experiments, the training error of $B$ is higher than $A$ . This means that current SGD-based optimizers fail to find such solutions. The problem lies not in the model's representational capacity but in the optimization difficulty.

4. Core Idea: Residual Learning

4.1 Core Intuition

If the cause of the degradation problem is that "learning identity mappings is difficult," the solution is straightforward: explicitly embed identity mappings into the network.

Let the function that a block in a conventional network must learn be $\mathcal{H}(\mathbf{x})$ . The original goal is to learn this $\mathcal{H}(\mathbf{x})$ directly. If $\mathcal{H}(\mathbf{x}) = \mathbf{x}$ (identity mapping), learning this function through a stack of nonlinear layers is difficult.

The core idea of ResNet is to reformulate the problem so that the network learns the residual instead of directly learning $\mathcal{H}(\mathbf{x})$ .

\mathcal{F}(\mathbf{x}) := \mathcal{H}(\mathbf{x}) - \mathbf{x}

Therefore:

\mathcal{H}(\mathbf{x}) = \mathcal{F}(\mathbf{x}) + \mathbf{x}

If the optimal mapping is close to identity, driving $\mathcal{F}(\mathbf{x})$ to zero is much easier than making $\mathcal{H}(\mathbf{x})$ identity. Initializing or learning the weights of nonlinear layers to be close to zero is a natural operation.

4.2 Structure of the Residual Block

A Residual Block has the following structure:

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

Where:

$\mathbf{x}$ : Input to the block
$\mathcal{F}(\mathbf{x}, \{W_i\})$ : The residual function to be learned (2-3 convolution layers)
$\mathbf{y}$ : Output of the block

The addition ( $+$ ) in $\mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$ is performed as element-wise addition, called a Shortcut Connection or Skip Connection. This operation requires no additional parameters and incurs negligible computational overhead.

For a Residual Block with two layers:

\mathcal{F} = W_2 \sigma(W_1 \mathbf{x})

Where $\sigma$ is the ReLU activation function. Bias terms are omitted for notational convenience.

4.3 Handling Dimension Mismatch

When the dimensions of $\mathcal{F}(\mathbf{x})$ and $\mathbf{x}$ differ (at downsampling stages where the number of feature map channels changes), they cannot be added directly. To address this, the paper uses a Linear Projection:

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s \mathbf{x}

Where $W_s$ is a projection matrix to match dimensions. The paper experimented with three options:

Option A: Match dimensions with zero-padding (no additional parameters)
Option B: Use 1x1 convolution projection only when dimensions change
Option C: Use 1x1 convolution projection for all shortcuts

Experimental results showed that all three options were vastly superior to Plain Networks, with minimal differences between options. This demonstrates that the projection is not the key to solving degradation -- the identity shortcut itself is the key. The final ResNet adopted Option B for memory and computational efficiency.

5. Mathematical Analysis: Gradient Flow

5.1 Forward Propagation

Let us analyze forward propagation through Residual Blocks. If the output of the $l$ -th Residual Block is $\mathbf{x}_l$ :

\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l, W_l)

Expanding this recursively, the output at any deep layer $L$ is:

\mathbf{x}_L = \mathbf{x}_l + \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, W_i)

The significance of this equation is profound. The feature at any deep layer $L$ is expressed as the feature from a shallow layer $l$ plus the sum of all residual functions in between. In a plain network, this would be a chain of matrix multiplications, whereas in ResNet it takes the form of addition.

5.2 Backward Propagation and the Gradient Highway

Now let us examine the crucial backward propagation. If the loss is $\mathcal{L}$ , by the chain rule:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot \frac{\partial \mathbf{x}_L}{\partial \mathbf{x}_l}

Substituting the forward equation derived earlier:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot \left(1 + \frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, W_i)\right)

The key here is the constant term 1. Gradients can flow directly from the loss to any layer through this path. Even if the term $\frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, W_i)$ becomes arbitrarily small, the constant 1 ensures that gradients never completely vanish.

This is the principle by which ResNet forms a Gradient Highway. In a plain network, gradients must be multiplied through the weight matrices of all layers:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \prod_{i=l}^{L-1} \frac{\partial \mathbf{x}_{i+1}}{\partial \mathbf{x}_i} \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L}

In this product form, even if each factor is only slightly less than 1, the gradient decreases exponentially. In ResNet's additive form, this problem does not arise.

5.3 Why Learning the "Residual" is Easier

For a more intuitive mathematical explanation: if the optimal transformation at each layer makes only small modifications to the input (a natural assumption for deep networks), the residual function $\mathcal{F}(\mathbf{x})$ should output values close to zero.

Since weight matrices are initialized close to zero, each Residual Block performs a mapping close to identity at the beginning of training. This can be interpreted as the deep network behaving like a shallow network early in training, with each block gradually learning useful transformations over time.

6. Architecture Details

6.1 Overall Structure

ResNet is based on VGGNet's design philosophy with the addition of Shortcut Connections. All ResNet variants share the following common structure:

conv1: 7x7 Convolution, stride 2, 64 filters, BatchNorm, ReLU
Max Pooling: 3x3, stride 2
conv2_x through conv5_x: Stacks of Residual Blocks
Global Average Pooling: Reduces feature maps to 1x1
Fully Connected Layer: 1000-class Softmax

When the feature map size is halved (at the first block of conv3_x, conv4_x, conv5_x), the number of channels doubles. Downsampling is performed via stride-2 convolution.

6.2 Basic Block (ResNet-18, ResNet-34)

The Basic Block consists of two 3x3 convolutions.

Input (C channels)
  |
  |---> 3x3 Conv, C filters, BN, ReLU
  |     3x3 Conv, C filters, BN
  |
  +---> (Identity Shortcut)
  |
  + <-- Element-wise Addition
  |
  ReLU
  |
Output (C channels)

Batch Normalization is applied after each convolution, and ReLU is applied after the addition.

6.3 Bottleneck Block (ResNet-50, ResNet-101, ResNet-152)

For ResNets with 50 or more layers, a Bottleneck structure is used for computational efficiency. It consists of three convolutions (1x1, 3x3, 1x1).

Input (4C channels)
  |
  |---> 1x1 Conv, C filters, BN, ReLU    (Channel reduction: 4C -> C)
  |     3x3 Conv, C filters, BN, ReLU    (Spatial processing)
  |     1x1 Conv, 4C filters, BN         (Channel restoration: C -> 4C)
  |
  +---> (Identity Shortcut)
  |
  + <-- Element-wise Addition
  |
  ReLU
  |
Output (4C channels)

The key idea of the Bottleneck is to reduce the channel count to 1/4 using 1x1 convolution, perform the expensive 3x3 convolution, then restore the channels with another 1x1 convolution. Thanks to this structure, ResNet-50 is deeper than ResNet-34 but maintains a similar level of FLOPs.

6.4 Architecture Comparison Table

Layer	Output Size	ResNet-18	ResNet-34	ResNet-50	ResNet-101	ResNet-152
conv1	112x112	7x7, 64, stride 2	7x7, 64, stride 2	7x7, 64, stride 2	7x7, 64, stride 2	7x7, 64, stride 2
pool	56x56	3x3 max pool, stride 2	3x3 max pool, stride 2	3x3 max pool, stride 2	3x3 max pool, stride 2	3x3 max pool, stride 2
conv2_x	56x56	[3x3, 64] x2	[3x3, 64] x3	[1x1, 64; 3x3, 64; 1x1, 256] x3	[1x1, 64; 3x3, 64; 1x1, 256] x3	[1x1, 64; 3x3, 64; 1x1, 256] x3
conv3_x	28x28	[3x3, 128] x2	[3x3, 128] x4	[1x1, 128; 3x3, 128; 1x1, 512] x4	[1x1, 128; 3x3, 128; 1x1, 512] x4	[1x1, 128; 3x3, 128; 1x1, 512] x8
conv4_x	14x14	[3x3, 256] x2	[3x3, 256] x6	[1x1, 256; 3x3, 256; 1x1, 1024] x6	[1x1, 256; 3x3, 256; 1x1, 1024] x23	[1x1, 256; 3x3, 256; 1x1, 1024] x36
conv5_x	7x7	[3x3, 512] x2	[3x3, 512] x3	[1x1, 512; 3x3, 512; 1x1, 2048] x3	[1x1, 512; 3x3, 512; 1x1, 2048] x3	[1x1, 512; 3x3, 512; 1x1, 2048] x3
	1x1	Global Average Pool, 1000-d FC, Softmax

6.5 Parameter Count and Computational Cost

Model	Layers	Parameters	FLOPs
VGG-19	19	144M	19.6B
ResNet-18	18	11.7M	1.8B
ResNet-34	34	21.8M	3.6B
ResNet-50	50	25.6M	3.8B
ResNet-101	101	44.5M	7.6B
ResNet-152	152	60.2M	11.3B

A notable observation is that ResNet-152 is 8 times deeper than VGG-19 yet requires fewer FLOPs and less than half the parameters. This is because VGGNet uses the majority of its parameters in the final Fully Connected layers, whereas ResNet uses Global Average Pooling to dramatically reduce FC layer parameters.

7. Experimental Results

7.1 ImageNet Classification

Confirming Degradation in Plain Networks

First, the paper confirmed the degradation problem in Plain Networks without shortcuts.

Model	Top-1 Error (%)	Top-5 Error (%)
Plain-18	27.94	-
Plain-34	28.54	-

The 34-layer Plain Network shows an error rate 0.6% higher than the 18-layer version. This is the degradation problem.

Effect of Residual Networks

Adding Shortcut Connections to the same architecture:

Model	Top-1 Error (%)	Top-5 Error (%)
ResNet-18	27.88	-
ResNet-34	25.03	-

ResNet-34 achieved an error rate 2.85% lower than ResNet-18. The degradation problem observed in Plain Networks was completely resolved, with clear performance improvements as depth increased.

Bottleneck ResNet Results (10-crop Testing)

Model	Top-1 Error (%)	Top-5 Error (%)
ResNet-50	22.85	6.71
ResNet-101	21.75	6.05
ResNet-152	21.43	5.71

ResNet-152's single-model Top-5 Error was 4.49% (Multi-scale, Multi-crop), and an ensemble of 6 models achieved 3.57%, securing first place in the ImageNet ILSVRC 2015 Classification track.

Comparison with VGG and GoogLeNet

Model	Top-5 Error (%)	Ensemble Top-5 Error (%)
VGG-16	7.3	-
GoogLeNet	6.7	-
ResNet-152 (single model)	4.49	-
ResNet Ensemble (6 models)	-	3.57

7.2 CIFAR-10 Experiments

The degradation problem was also confirmed and ResNet's effectiveness validated on the CIFAR-10 dataset (32x32 images, 10 classes). The CIFAR-10 ResNet differs from the ImageNet version: the first layer is a 3x3 convolution, and it uses {n, n, n} Residual Blocks across 3 stages (with feature map sizes of 32x32, 16x16, and 8x8 respectively).

Model	Layers	Error (%)
ResNet-20	20	8.75
ResNet-32	32	7.51
ResNet-44	44	7.17
ResNet-56	56	6.97
ResNet-110	110	6.43
ResNet-1202	1202	7.93

Performance improved consistently up to 110 layers. The 1202-layer network had low training error but higher test error than 110 layers, which the paper attributed to overfitting. With 19.4M parameters, it was excessively large relative to the small dataset (50,000 training images). The paper did not apply regularization techniques (such as Dropout) and noted that applying them could yield improvements.

7.3 COCO Object Detection and Segmentation

ResNet's effectiveness was validated beyond Image Classification in Object Detection and Segmentation.

PASCAL VOC and COCO Detection

Replacing Faster R-CNN's backbone from VGG-16 to ResNet-101:

COCO Detection: 6.0% improvement in mAP@[.5, .95] over VGG-16 (a relative 28% improvement)
1st place in ILSVRC 2015 Detection Task
1st place in COCO 2015 Detection Task

COCO Segmentation

1st place in COCO 2015 Segmentation Task

These results demonstrated that ResNet is not merely a classification-specific model but serves as a general-purpose feature extractor providing strong performance across diverse vision tasks.

8. Implementation Details

8.1 He Initialization

Kaiming He, the first author of the ResNet paper, had already proposed a weight initialization method suitable for ReLU networks prior to ResNet ("Delving Deep into Rectifiers", He et al., 2015).

W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)

Where $n_{in}$ is the number of input units for the layer. The variance is set to $\frac{2}{n_{in}}$ to account for ReLU zeroing out half of the input (the negative portion). Xavier Initialization ( $\frac{1}{n_{in}}$ ) is appropriate for Sigmoid/Tanh but causes variance to diminish progressively with ReLU.

He Initialization maintains consistent output variance across layers, preventing signal vanishing or explosion during the forward pass.

8.2 Batch Normalization

ResNet applies Batch Normalization (BN) after every convolution layer.

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

y_i = \gamma \hat{x}_i + \beta

Where:

$\mu_B$ , $\sigma_B^2$ : Mean and variance of the mini-batch
$\gamma$ , $\beta$ : Learnable scale and shift parameters
$\epsilon$ : A small constant for numerical stability

The roles of BN are:

Mitigating Internal Covariate Shift: Stabilizes training by normalizing the input distribution of each layer
Regularization Effect: Mini-batch-level normalization adds slight noise, acting as a regularizer
Enabling Higher Learning Rates: Stabilized distributions allow the use of higher learning rates

In ResNet, the ordering of BN is Conv -> BN -> ReLU (post-activation). This ordering was later improved in the follow-up Pre-activation ResNet.

8.3 Training Schedule

Training configuration on ImageNet:

Hyperparameter	Value
Optimizer	SGD with Momentum
Momentum	0.9
Weight Decay	0.0001
Batch Size	256
Initial Learning Rate	0.1
LR Schedule	Divided by 10 every 30 epochs
Total Epochs	~90
Data Augmentation	Random Crop (224x224), Horizontal Flip, Color Jittering
Preprocessing	Per-pixel Mean Subtraction

During training, images are randomly resized within the [256, 480] range and then Random Cropped to 224x224. At test time, 10-crop testing (4 corners + center + horizontal flips of each) is used, and for multi-scale testing, fully convolutional inference is performed at sizes {224, 256, 384, 480, 640}.

8.4 Absence of Dropout

Interestingly, ResNet does not use Dropout. Batch Normalization provides sufficient regularization, and the Bottleneck structure inherently limits the number of parameters. Global Average Pooling also dramatically reduces FC layer parameters, lowering the risk of overfitting.

9. PyTorch Implementation

9.1 Basic Block

import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    """Basic Residual Block used in ResNet-18 and ResNet-34"""
    expansion = 1  # output channels = input channels * expansion

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        # First 3x3 Conv (downsampling possible via stride)
        self.conv1 = nn.Conv2d(
            in_channels, out_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        # Second 3x3 Conv
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3,
            stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Shortcut: 1x1 Conv projection when dimensions differ
        self.downsample = downsample

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        # Shortcut Connection
        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  # F(x) + x
        out = self.relu(out)

        return out

9.2 Bottleneck Block

class Bottleneck(nn.Module):
    """Bottleneck Block used in ResNet-50, ResNet-101, ResNet-152"""
    expansion = 4  # output channels = mid channels * 4

    def __init__(self, in_channels, mid_channels, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        out_channels = mid_channels * self.expansion

        # 1x1 Conv: Channel reduction (Squeeze)
        self.conv1 = nn.Conv2d(
            in_channels, mid_channels, kernel_size=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(mid_channels)

        # 3x3 Conv: Spatial processing (downsampling possible via stride)
        self.conv2 = nn.Conv2d(
            mid_channels, mid_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(mid_channels)

        # 1x1 Conv: Channel restoration (Expand)
        self.conv3 = nn.Conv2d(
            mid_channels, out_channels, kernel_size=1, bias=False
        )
        self.bn3 = nn.BatchNorm2d(out_channels)

        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

    def forward(self, x):
        identity = x

        # 1x1 -> BN -> ReLU
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        # 3x3 -> BN -> ReLU
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        # 1x1 -> BN
        out = self.conv3(out)
        out = self.bn3(out)

        # Shortcut Connection
        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  # F(x) + x
        out = self.relu(out)

        return out

9.3 Full ResNet Model

class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=1000):
        """
        Args:
            block: BasicBlock or Bottleneck
            layers: Number of blocks per stage [conv2_x, conv3_x, conv4_x, conv5_x]
            num_classes: Number of classification classes
        """
        super(ResNet, self).__init__()
        self.in_channels = 64

        # conv1: 7x7, stride 2
        self.conv1 = nn.Conv2d(
            3, 64, kernel_size=7, stride=2, padding=3, bias=False
        )
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # conv2_x through conv5_x
        self.layer1 = self._make_layer(block, 64, layers[0], stride=1)
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

        # Classification Head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        # He Initialization
        self._initialize_weights()

    def _make_layer(self, block, mid_channels, num_blocks, stride):
        downsample = None
        out_channels = mid_channels * block.expansion

        # Downsampling needed at the first block
        if stride != 1 or self.in_channels != out_channels:
            downsample = nn.Sequential(
                nn.Conv2d(
                    self.in_channels, out_channels,
                    kernel_size=1, stride=stride, bias=False
                ),
                nn.BatchNorm2d(out_channels),
            )

        layers = []
        layers.append(block(self.in_channels, mid_channels, stride, downsample))
        self.in_channels = out_channels

        for _ in range(1, num_blocks):
            layers.append(block(self.in_channels, mid_channels))

        return nn.Sequential(*layers)

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # He Initialization (fan_out mode)
                nn.init.kaiming_normal_(
                    m.weight, mode='fan_out', nonlinearity='relu'
                )
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        # Stem
        x = self.conv1(x)       # 224x224 -> 112x112
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)     # 112x112 -> 56x56

        # Residual Stages
        x = self.layer1(x)      # 56x56
        x = self.layer2(x)      # 28x28
        x = self.layer3(x)      # 14x14
        x = self.layer4(x)      # 7x7

        # Classification Head
        x = self.avgpool(x)     # 1x1
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x


# Model creation functions
def resnet18(num_classes=1000):
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)

def resnet34(num_classes=1000):
    return ResNet(BasicBlock, [3, 4, 6, 3], num_classes)

def resnet50(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)

def resnet101(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 23, 3], num_classes)

def resnet152(num_classes=1000):
    return ResNet(Bottleneck, [3, 8, 36, 3], num_classes)

9.4 Usage Example

# Create ResNet-50 and perform a Forward Pass
model = resnet50(num_classes=1000)
x = torch.randn(1, 3, 224, 224)  # Batch=1, RGB, 224x224
output = model(x)
print(f"Output shape: {output.shape}")  # torch.Size([1, 1000])

# Check parameter count
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
# ResNet-50: approximately 25,557,032 (25.6M)

# Using PyTorch official pre-trained model
import torchvision.models as models
resnet50_pretrained = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

10. Pre-activation ResNet: Identity Mappings in Deep Residual Networks

10.1 Motivation for the Follow-up Paper

Shortly after the ResNet paper was published, the same authors (He et al.) presented "Identity Mappings in Deep Residual Networks" at ECCV 2016. This paper demonstrated that rearranging the order of operations within a Residual Block enables more effective training of even deeper networks (1001 layers).

10.2 Original vs Pre-activation

Original ResNet (Post-activation):

\mathbf{x}_{l+1} = \text{ReLU}(\mathcal{F}(\mathbf{x}_l) + \mathbf{x}_l)

In this structure, since ReLU is positioned after the addition, signals passing through the shortcut path are also affected by ReLU. This prevents the identity mapping from being a true identity, weakening the Gradient Highway effect.

Pre-activation ResNet:

\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\text{BN}(\text{ReLU}(\mathbf{x}_l)))

The operation order is changed to BN -> ReLU -> Conv -> BN -> ReLU -> Conv. This ensures the shortcut path becomes a pure identity mapping.

10.3 Structure Comparison

[Original ResNet]                [Pre-activation ResNet]
Input --+                        Input --+
        |                                |
     Conv                             BN
        |                                |
      BN                             ReLU
        |                                |
     ReLU                            Conv
        |                                |
     Conv                             BN
        |                                |
      BN                             ReLU
        |                                |
   (+) <-+ shortcut                   Conv
        |                                |
     ReLU                           (+) <-+ shortcut
        |                                |
     Output                          Output

10.4 Mathematical Advantages

For the pre-activation structure, forward propagation is:

\mathbf{x}_L = \mathbf{x}_l + \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i)

Backward propagation:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot \left(1 + \frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i)\right)

Since the shortcut is a pure identity, the constant term 1 is exactly preserved. In the original ResNet, the intervening ReLU causes this constant term to deviate from exactly 1.

10.5 Experimental Results

Model	CIFAR-10 Error (%)	CIFAR-100 Error (%)
ResNet-110 (original)	6.43	-
ResNet-1001 (original)	~7.61	-
ResNet-1001 (pre-activation)	4.62	22.71

The pre-activation structure showed particularly significant performance improvements in very deep networks (1001 layers) compared to the original structure. This experimentally confirmed that pure identity mappings are critically important for gradient flow.

10.6 PyTorch Implementation

class PreActBasicBlock(nn.Module):
    """Pre-activation Basic Block (BN -> ReLU -> Conv)"""
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(PreActBasicBlock, self).__init__()
        self.bn1 = nn.BatchNorm2d(in_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv1 = nn.Conv2d(
            in_channels, out_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3,
            stride=1, padding=1, bias=False
        )
        self.downsample = downsample

    def forward(self, x):
        identity = x

        # Pre-activation: BN -> ReLU -> Conv
        out = self.bn1(x)
        out = self.relu(out)

        if self.downsample is not None:
            identity = self.downsample(out)

        out = self.conv1(out)
        out = self.bn2(out)
        out = self.relu(out)
        out = self.conv2(out)

        out += identity
        return out

11. Impact and Follow-up Research

The Residual Learning paradigm proposed by ResNet became the foundation for numerous subsequent architectural innovations. Let us examine the major follow-up works.

11.1 ResNeXt (2017)

"Aggregated Residual Transformations for Deep Neural Networks" - Xie et al., Facebook AI Research

ResNeXt introduced a new dimension called Cardinality (number of groups) to ResNet's Residual Block. It performs multiple transformation paths in parallel within a single block, then sums them.

\mathcal{F}(\mathbf{x}) = \sum_{i=1}^{C} \mathcal{T}_i(\mathbf{x})

Where $C$ is the cardinality (e.g., 32) and $\mathcal{T}_i$ is the transformation in each path. In practice, this is efficiently implemented using Grouped Convolution.

ResNeXt-101 (32x4d) achieved higher accuracy than ResNet-101 with the same computational cost, demonstrating that cardinality is a more effective dimension than width (channel count) or depth (layer count).

11.2 DenseNet (2017)

"Densely Connected Convolutional Networks" - Huang et al., Cornell/Facebook

DenseNet extends Residual Connections to the extreme. Each layer is directly connected to all preceding layers. While ResNet uses element-wise addition, DenseNet uses channel-wise concatenation.

\mathbf{x}_l = \mathcal{H}_l([\mathbf{x}_0, \mathbf{x}_1, ..., \mathbf{x}_{l-1}])

This structure maximizes feature reuse and improves parameter efficiency. DenseNet-121 achieved comparable performance to ResNet-50 with fewer parameters.

11.3 SENet (2018)

"Squeeze-and-Excitation Networks" - Hu et al., Momenta

SENet added an SE Module to Residual Blocks that models inter-channel relationships. It learns the importance of each channel and recalibrates the weights accordingly.

\mathbf{s} = \sigma(\mathbf{W}_2 \cdot \text{ReLU}(\mathbf{W}_1 \cdot \text{GAP}(\mathbf{x})))

\tilde{\mathbf{x}} = \mathbf{s} \odot \mathbf{x}

Where GAP is Global Average Pooling, $\sigma$ is Sigmoid, and $\odot$ is channel-wise multiplication. SENet won first place in ILSVRC 2017 Classification with a Top-5 Error of 2.251%.

11.4 EfficientNet (2019)

"EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" - Tan & Le, Google Brain

EfficientNet proposed Compound Scaling, which scales width, depth, and resolution simultaneously in a balanced manner. It is based on MBConv (Mobile Inverted Bottleneck) blocks, which also use Residual Connections.

\text{depth}: d = \alpha^\phi, \quad \text{width}: w = \beta^\phi, \quad \text{resolution}: r = \gamma^\phi

\text{s.t.} \quad \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2

EfficientNet-B7 achieved state-of-the-art ImageNet accuracy at the time with 8.4 times fewer parameters than ResNet.

11.5 ConvNeXt (2022)

"A ConvNet for the 2020s" - Liu et al., Facebook AI Research

ConvNeXt applied Vision Transformer (ViT) design principles to CNNs to create a "modernized ResNet." Starting from ResNet-50, the following changes were applied sequentially:

Modernized Training Recipe (300 epochs, AdamW, Mixup, Cutmix, etc.)
Stage Ratio Change: (3, 4, 6, 3) -> (3, 3, 9, 3)
"Patchify" Stem: 7x7 Conv -> 4x4 Conv, stride 4
ResNeXt-style Grouped Convolution
Inverted Bottleneck
Large Kernel Size: 3x3 -> 7x7 Depthwise Conv
Activation Function: ReLU -> GELU
Normalization: BN -> Layer Normalization

ConvNeXt-T achieved comparable performance to Swin-T (82.1% top-1 accuracy), demonstrating that pure CNN architectures can compete with Transformers. This research reaffirmed how robust ResNet's design foundation is.

12. Residual Connections in Modern Architectures

12.1 Skip Connections in Transformers

The Transformer architecture proposed in Vaswani et al.'s "Attention Is All You Need" (2017) uses Residual Connections in every sub-layer.

\text{Output} = \text{LayerNorm}(\mathbf{x} + \text{SubLayer}(\mathbf{x}))

Where SubLayer is either Multi-Head Attention or a Feed-Forward Network. For the same reasons proven in ResNet, training deep Transformers without these Residual Connections is virtually impossible.

12.2 Pre-LayerNorm and Post-LayerNorm

The Transformer world has a similar debate to ResNet's pre-activation vs. post-activation.

Post-LayerNorm (original Transformer):

\mathbf{x}_{l+1} = \text{LN}(\mathbf{x}_l + \text{SubLayer}(\mathbf{x}_l))

Pre-LayerNorm (standard since GPT-2):

\mathbf{x}_{l+1} = \mathbf{x}_l + \text{SubLayer}(\text{LN}(\mathbf{x}_l))

Pre-LayerNorm operates on the same principle as ResNet's pre-activation, keeping the shortcut path as a pure identity to improve gradient flow. Most state-of-the-art Large Language Models including GPT-2 and GPT-3 use Pre-LayerNorm.

12.3 Residual Connections in Diffusion Models

The U-Net architecture of Denoising Diffusion Probabilistic Models (DDPM) also uses Skip Connections in each Residual Block. Long skip connections between the U-Net's encoder and decoder are combined with residual skip connections within blocks, enabling effective utilization of features at various scales.

12.4 Vision Transformer (ViT)

ViT (Vision Transformer) divides images into 16x16 patches and feeds them into a Transformer Encoder. Each Transformer block naturally uses Residual Connections, and without them, training ViTs with more than 12 layers becomes difficult.

12.5 Key Lessons

The most important legacy of ResNet is not a specific architecture but the design principle of Residual Connections. This principle can be summarized as follows:

Set identity mapping as the default: Even if the network learns nothing, it should at least be able to pass the input through unchanged.
Ensure a Gradient Highway: Create paths through which gradients can flow directly from the loss to every layer.
Depth is a free dimension: With Residual Connections, increasing network depth is always beneficial, or at least never detrimental.

This principle applies universally regardless of architecture type, spanning CNNs, Transformers, Diffusion Models, State Space Models, and beyond.

13. Limitations and Criticisms

13.1 Inefficiency of Feature Reuse

According to the study by Veit et al. (2016), "Residual Networks Behave Like Ensembles of Relatively Shallow Networks," most layers in ResNet transmit information through very short paths (shallow paths), and the contribution of very deep paths is minimal. This raises questions about whether all 152 layers are being utilized efficiently.

13.2 Element-wise Addition of Feature Maps

The authors of DenseNet argued that ResNet's element-wise addition can cause information loss. They contended that concatenation-based DenseNet enables more efficient feature reuse. However, concatenation suffers from the problem of rapidly increasing memory usage.

13.3 Computational Overhead

While parameter counts were reduced through Global Average Pooling and Bottleneck structures, the actual inference speed of very deep ResNets (ResNet-152) is not necessarily faster than VGGNet. Memory access costs and sequential dependencies can become practical bottlenecks.

14. Summary

ResNet is one of the most important papers in the history of deep learning. Its contributions can be summarized as follows:

Discovery and Definition of the Degradation Problem: Clearly identified the phenomenon where training error increases in deeper networks.
Residual Learning Framework: The $\mathcal{F}(\mathbf{x}) + \mathbf{x}$ structure, which explicitly includes identity mapping, enabled successful training of networks with hundreds of layers.
Gradient Highway Theory: Presented the mathematical mechanism by which Skip Connections directly propagate gradients.
Bottleneck Structure: Achieved both depth and efficiency through channel reduction/restoration using 1x1 convolutions.
Overwhelming Experimental Results: Dominated all existing methods on ImageNet (3.57% top-5 error), CIFAR-10, and COCO Detection/Segmentation.
Universal Design Principle: Residual Connections have become an essential element in all major modern deep learning architectures, including Transformers and Diffusion Models.

A single simple addition operation ( $+ \mathbf{x}$ ) broke through the depth barrier of deep learning and laid the foundation for a decade of AI advancement. ResNet is a quintessential example demonstrating that sometimes the simplest ideas are the most powerful.

15. References

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity Mappings in Deep Residual Networks. ECCV 2016. arXiv:1603.05027
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015. arXiv:1502.01852
Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015. arXiv:1409.1556
Szegedy, C., et al. (2015). Going Deeper with Convolutions. CVPR 2015. arXiv:1409.4842
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015. arXiv:1502.03167
Xie, S., et al. (2017). Aggregated Residual Transformations for Deep Neural Networks. CVPR 2017. arXiv:1611.05431
Huang, G., et al. (2017). Densely Connected Convolutional Networks. CVPR 2017. arXiv:1608.06993
Hu, J., et al. (2018). Squeeze-and-Excitation Networks. CVPR 2018. arXiv:1709.01507
Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019. arXiv:1905.11946
Liu, Z., et al. (2022). A ConvNet for the 2020s. CVPR 2022. arXiv:2201.03545
Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
Veit, A., et al. (2016). Residual Networks Behave Like Ensembles of Relatively Shallow Networks. NeurIPS 2016. arXiv:1605.06431
KaimingHe/deep-residual-networks. GitHub Repository
ILSVRC 2015 Results. ImageNet Challenge