CNNアーキテクチャ完全ガイド

畳み込みニューラルネットワーク（CNN）はコンピュータビジョン革命の基盤です。1998年のLeNetから2022年のConvNeXtやVision Transformerまで、CNNアーキテクチャは目覚ましいスピードで進化してきました。このガイドでは、主要CNNアーキテクチャの構造的革新を解説し、PyTorchで実装する方法を学びます。

1. CNNの基礎

畳み込みを直感的に理解する

畳み込みは画像からローカルなパターンを抽出する操作です。小さなフィルター（カーネル）が画像上をスライドして特徴マップを生成します。

入力画像 (5x5)      カーネル (3x3)       出力特徴マップ (3x3)
1 1 1 0 0           1 0 1              4 3 4
0 1 1 1 0    *      0 1 0    =         2 4 3
0 0 1 1 1           1 0 1              2 3 4
0 0 1 1 0
0 1 1 0 0

各位置において、出力はカーネルと画像パッチの要素ごとの積の和です。

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

# 畳み込みの可視化
def visualize_convolution():
    image = torch.tensor([[
        [1., 1., 1., 0., 0.],
        [0., 1., 1., 1., 0.],
        [0., 0., 1., 1., 1.],
        [0., 0., 1., 1., 0.],
        [0., 1., 1., 0., 0.]
    ]]).unsqueeze(0)  # (1, 1, 5, 5)

    # エッジ検出カーネル
    edge_kernel = torch.tensor([[
        [[-1., -1., -1.],
         [-1.,  8., -1.],
         [-1., -1., -1.]]
    ]])  # (1, 1, 3, 3)

    output = F.conv2d(image, edge_kernel, padding=1)

    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    axes[0].imshow(image[0, 0].numpy(), cmap='gray')
    axes[0].set_title('入力画像')
    axes[1].imshow(edge_kernel[0, 0].numpy(), cmap='RdYlBu')
    axes[1].set_title('エッジ検出カーネル')
    axes[2].imshow(output[0, 0].detach().numpy(), cmap='gray')
    axes[2].set_title('出力特徴マップ')
    plt.tight_layout()
    plt.show()

カーネル、ストライド、パディング

import torch
import torch.nn as nn

# Conv2dの基本パラメータ
conv = nn.Conv2d(
    in_channels=3,    # 入力チャネル数 (RGB=3)
    out_channels=64,  # 出力チャネル数 (フィルター数)
    kernel_size=3,    # カーネルサイズ (3x3)
    stride=1,         # ストライド
    padding=1,        # パディング (same padding)
    bias=True
)

# 出力サイズの計算式
# H_out = floor((H_in + 2*padding - kernel_size) / stride + 1)
def calc_output_size(input_size, kernel_size, stride, padding):
    return (input_size + 2 * padding - kernel_size) // stride + 1

print(calc_output_size(224, 3, 1, 1))   # 224 (same padding)
print(calc_output_size(224, 3, 2, 1))   # 112 (stride 2, サイズ半減)
print(calc_output_size(224, 7, 2, 3))   # 112 (AlexNetの第1層)

# パラメータ数
# Conv2d: (kernel_h * kernel_w * in_channels + 1) * out_channels
params = (3 * 3 * 3 + 1) * 64
print(f"Conv(3->64, 3x3) パラメータ数: {params:,}")  # 1,792

プーリング（最大、平均、グローバル）

import torch
import torch.nn as nn

x = torch.randn(1, 64, 28, 28)

# 最大プーリング
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
out_max = max_pool(x)  # (1, 64, 14, 14)

# 平均プーリング
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
out_avg = avg_pool(x)  # (1, 64, 14, 14)

# グローバル平均プーリング (GAP) - 空間次元を1x1に圧縮
gap = nn.AdaptiveAvgPool2d(1)
out_gap = gap(x)                 # (1, 64, 1, 1)
out_gap_flat = out_gap.flatten(1)  # (1, 64)

# アダプティブプーリング - 出力サイズを指定
adaptive = nn.AdaptiveAvgPool2d((7, 7))
out_adaptive = adaptive(x)  # 入力サイズに関わらず (1, 64, 7, 7)

print(f"入力: {x.shape}")
print(f"最大プーリング: {out_max.shape}")
print(f"GAP: {out_gap_flat.shape}")

受容野の計算

def calculate_receptive_field(layers):
    """
    各層の受容野を計算する。
    layers: (kernel_size, stride, dilation) のリスト
    """
    rf = 1
    jump = 1

    for k, s, d in layers:
        effective_k = d * (k - 1) + 1
        rf = rf + (effective_k - 1) * jump
        jump = jump * s

    return rf

# VGGスタイル（3x3畳み込みのみ）
vgg_layers = [
    (3, 1, 1),  # conv1
    (3, 1, 1),  # conv2
    (2, 2, 1),  # pool
    (3, 1, 1),  # conv3
    (3, 1, 1),  # conv4
    (2, 2, 1),  # pool
]

rf = calculate_receptive_field(vgg_layers)
print(f"VGG 6層後の受容野: {rf}x{rf} ピクセル")

# 注: 3x3畳み込み2回 = 5x5と同じ受容野
# ただしパラメータ数は 2*(9*C^2) vs 25*C^2 — 3x3を2回使うと28%削減

2. CNNアーキテクチャの歴史

LeNet-5（1998年、LeCun）— 最初の実用的CNN

Yann LeCunが1998年に開発したLeNet-5は、手書き数字認識（MNIST）のために設計された最初の実用的CNNです。

アーキテクチャ: Input(32x32) -> C1(conv, 6@28x28) -> S2(pool, 6@14x14) -> C3(conv, 16@10x10) -> S4(pool, 16@5x5) -> C5(conv, 120@1x1) -> F6(fc, 84) -> Output(10)

import torch
import torch.nn as nn

class LeNet5(nn.Module):
    """LeNet-5（ReLUを追加したバージョン）"""

    def __init__(self, num_classes=10):
        super(LeNet5, self).__init__()

        self.features = nn.Sequential(
            # C1: 1@32x32 -> 6@28x28
            nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0),
            nn.Tanh(),
            # S2: 6@28x28 -> 6@14x14
            nn.AvgPool2d(kernel_size=2, stride=2),

            # C3: 6@14x14 -> 16@10x10
            nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
            nn.Tanh(),
            # S4: 16@10x10 -> 16@5x5
            nn.AvgPool2d(kernel_size=2, stride=2),

            # C5: 16@5x5 -> 120@1x1
            nn.Conv2d(16, 120, kernel_size=5, stride=1, padding=0),
            nn.Tanh(),
        )

        self.classifier = nn.Sequential(
            nn.Linear(120, 84),
            nn.Tanh(),
            nn.Linear(84, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.flatten(1)
        x = self.classifier(x)
        return x


model = LeNet5(num_classes=10)
x = torch.randn(4, 1, 32, 32)
out = model(x)
print(f"LeNet-5 出力: {out.shape}")  # (4, 10)

total_params = sum(p.numel() for p in model.parameters())
print(f"LeNet-5 総パラメータ数: {total_params:,}")  # ~60,000

AlexNet（2012年、Krizhevsky）— 深層学習の復興

AlexNetは2012年のImageNetコンペティションでトップ5エラー率15.3%を達成し、以前の最高記録26.2%を大きく超えて深層学習時代の幕を開けました。

主要な革新点:

ReLU活性化関数（Tanhより6倍高速な学習）
ドロップアウト（0.5）による過学習防止
データ拡張（クロッピング、フリップ）
局所応答正規化（LRN）
デュアルGPU学習

import torch
import torch.nn as nn

class AlexNet(nn.Module):
    """AlexNet実装"""

    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()

        self.features = nn.Sequential(
            # 層1: 3@224x224 -> 96@55x55
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 96@27x27

            # 層2: 96@27x27 -> 256@27x27
            nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 256@13x13

            # 層3
            nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),

            # 層4
            nn.Conv2d(384, 384, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),

            # 層5: 384@13x13 -> 256@13x13
            nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 256@6x6
        )

        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))

        self.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.flatten(1)
        x = self.classifier(x)
        return x


model = AlexNet(num_classes=1000)
x = torch.randn(4, 3, 224, 224)
out = model(x)
print(f"AlexNet 出力: {out.shape}")  # (4, 1000)
total_params = sum(p.numel() for p in model.parameters())
print(f"AlexNet パラメータ数: {total_params:,}")  # ~61M

VGGNet（2014年、Simonyan）— 深さの力

オックスフォード大学Visual Geometry GroupのVGGNetは、全体を通じて3x3カーネルのみを使用し、劇的な深さの増加を可能にしました。

なぜ3x3なのか？

3x3畳み込み2回 = 5x5と同じ受容野（28%のパラメータ削減）
3x3畳み込み3回 = 7x7と同じ受容野（45%のパラメータ削減）
非線形変換が増えることで表現能力が向上

import torch
import torch.nn as nn
from typing import List, Union

class VGG(nn.Module):
    """汎用VGG実装"""

    def __init__(self, features: nn.Module, num_classes: int = 1000, dropout: float = 0.5):
        super(VGG, self).__init__()
        self.features = features
        self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=dropout),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=dropout),
            nn.Linear(4096, num_classes)
        )
        self._initialize_weights()

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.flatten(1)
        x = self.classifier(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)


def make_layers(cfg: List[Union[str, int]], batch_norm: bool = False) -> nn.Sequential:
    layers: List[nn.Module] = []
    in_channels = 3
    for v in cfg:
        if v == 'M':
            layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
        else:
            v = int(v)
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    return nn.Sequential(*layers)


cfgs = {
    'vgg16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
    'vgg19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}

def vgg16(num_classes=1000):
    return VGG(make_layers(cfgs['vgg16'], batch_norm=True), num_classes=num_classes)

model_vgg16 = vgg16()
x = torch.randn(2, 3, 224, 224)
out = model_vgg16(x)
print(f"VGG-16 出力: {out.shape}")
params = sum(p.numel() for p in model_vgg16.parameters())
print(f"VGG-16 パラメータ数: {params:,}")  # ~138M

GoogLeNet/Inception（2014年、Szegedy）— マルチスケール並列処理

Inceptionモジュールの核心的なアイデアは、異なるカーネルサイズ（1x1、3x3、5x5）を並列で処理し、複数スケールの特徴を同時にキャプチャすることです。

import torch
import torch.nn as nn

class InceptionModule(nn.Module):
    """基本的なInceptionモジュール"""

    def __init__(self, in_channels, n1x1, n3x3_reduce, n3x3,
                 n5x5_reduce, n5x5, pool_proj):
        super(InceptionModule, self).__init__()

        # 1x1ブランチ
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, n1x1, kernel_size=1),
            nn.BatchNorm2d(n1x1),
            nn.ReLU(inplace=True)
        )

        # 1x1ボトルネック + 3x3
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, n3x3_reduce, kernel_size=1),
            nn.BatchNorm2d(n3x3_reduce),
            nn.ReLU(inplace=True),
            nn.Conv2d(n3x3_reduce, n3x3, kernel_size=3, padding=1),
            nn.BatchNorm2d(n3x3),
            nn.ReLU(inplace=True)
        )

        # 1x1ボトルネック + 5x5
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, n5x5_reduce, kernel_size=1),
            nn.BatchNorm2d(n5x5_reduce),
            nn.ReLU(inplace=True),
            nn.Conv2d(n5x5_reduce, n5x5, kernel_size=5, padding=2),
            nn.BatchNorm2d(n5x5),
            nn.ReLU(inplace=True)
        )

        # MaxPool + 1x1
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, pool_proj, kernel_size=1),
            nn.BatchNorm2d(pool_proj),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        b1 = self.branch1(x)
        b2 = self.branch2(x)
        b3 = self.branch3(x)
        b4 = self.branch4(x)
        return torch.cat([b1, b2, b3, b4], dim=1)


module = InceptionModule(192, 64, 96, 128, 16, 32, 32)
x = torch.randn(2, 192, 28, 28)
out = module(x)
print(f"Inception 出力: {out.shape}")  # (2, 256, 28, 28)

ResNet（2015年、He）— 残差接続による勾配消失問題の解決

2015年にHe Kaimingが導入したResNetは、スキップ接続を使って非常に深いネットワークで勾配を流し、152層のネットワークの学習を可能にしました。

核心的なアイデア: H(x) = F(x) + x

H(x)を直接学習する代わりに、各層は残差 F(x) = H(x) - x を学習します。最適なマッピングが恒等写像に近い場合、F(x)をゼロに近づける方がはるかに簡単です。

import torch
import torch.nn as nn
from typing import Optional, Type, List

class BasicBlock(nn.Module):
    """ResNet-18/34用の基本ブロック"""
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(BasicBlock, self).__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.downsample = downsample

    def forward(self, x):
        identity = x

        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  # 残差接続
        out = self.relu(out)

        return out


class Bottleneck(nn.Module):
    """ResNet-50/101/152用ボトルネックブロック"""
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(Bottleneck, self).__init__()

        # 1x1（チャネル削減）
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)

        # 3x3（空間処理）
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # 1x1（チャネル拡張: out_channels * 4）
        self.conv3 = nn.Conv2d(out_channels, out_channels * self.expansion,
                               kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)

        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

    def forward(self, x):
        identity = x

        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class ResNet(nn.Module):
    """ResNet完全実装"""

    def __init__(self, block, layers, num_classes=1000):
        super(ResNet, self).__init__()
        self.in_channels = 64

        # ステム
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # 4ステージ
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

        # 分類器
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        self._initialize_weights()

    def _make_layer(self, block, out_channels, blocks, stride=1):
        downsample = None
        if stride != 1 or self.in_channels != out_channels * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.in_channels, out_channels * block.expansion,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels * block.expansion)
            )

        layers = [block(self.in_channels, out_channels, stride, downsample)]
        self.in_channels = out_channels * block.expansion

        for _ in range(1, blocks):
            layers.append(block(self.in_channels, out_channels))

        return nn.Sequential(*layers)

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        x = self.maxpool(self.relu(self.bn1(self.conv1(x))))
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x)
        x = x.flatten(1)
        x = self.fc(x)
        return x


def resnet18(num_classes=1000):
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)

def resnet50(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)


# テスト
for name, model_fn in [('ResNet-18', resnet18), ('ResNet-50', resnet50)]:
    model = model_fn()
    x = torch.randn(2, 3, 224, 224)
    out = model(x)
    params = sum(p.numel() for p in model.parameters())
    print(f"{name}: 出力={out.shape}, パラメータ数={params:,}")

DenseNet（2017年、Huang）— 密な接続

DenseNetは各層をすべての前の層に接続します。L層のResNetにはL本の接続がありますが、DenseNetにはL(L+1)/2本の接続があります。

import torch
import torch.nn as nn
import torch.nn.functional as F

class DenseLayer(nn.Module):
    """単一のDenseNet層"""

    def __init__(self, in_channels, growth_rate, bn_size=4, drop_rate=0.0):
        super(DenseLayer, self).__init__()

        # ボトルネック: 1x1 convでチャネルを制限
        self.norm1 = nn.BatchNorm2d(in_channels)
        self.relu1 = nn.ReLU(inplace=True)
        self.conv1 = nn.Conv2d(in_channels, bn_size * growth_rate, kernel_size=1, bias=False)

        # 3x3 conv
        self.norm2 = nn.BatchNorm2d(bn_size * growth_rate)
        self.relu2 = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(bn_size * growth_rate, growth_rate,
                               kernel_size=3, padding=1, bias=False)

        self.drop_rate = drop_rate

    def forward(self, x):
        if isinstance(x, torch.Tensor):
            prev_features = [x]
        else:
            prev_features = x

        concat_input = torch.cat(prev_features, dim=1)

        out = self.conv1(self.relu1(self.norm1(concat_input)))
        out = self.conv2(self.relu2(self.norm2(out)))

        if self.drop_rate > 0:
            out = F.dropout(out, p=self.drop_rate, training=self.training)

        return out

MobileNet（2017年）— エッジデバイス向けの軽量化

MobileNetは深さ方向分離畳み込みを導入し、精度を保ちながら計算量を大幅に削減しました。モバイルやエッジデバイスへの展開に最適です。

import torch
import torch.nn as nn

class DepthwiseSeparableConv(nn.Module):
    """深さ方向分離畳み込み"""

    def __init__(self, in_channels, out_channels, stride=1):
        super(DepthwiseSeparableConv, self).__init__()

        # 深さ方向: 各入力チャネルを独立して処理
        self.depthwise = nn.Sequential(
            nn.Conv2d(in_channels, in_channels, kernel_size=3,
                      stride=stride, padding=1, groups=in_channels, bias=False),
            nn.BatchNorm2d(in_channels),
            nn.ReLU6(inplace=True)
        )

        # ポイントワイズ: 1x1 convでチャネルを結合
        self.pointwise = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU6(inplace=True)
        )

    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x


# パラメータ削減の比較
standard_conv_params = 3 * 3 * 512 * 512        # 標準畳み込み
dw_sep_params = (3 * 3 * 512) + (512 * 512)     # 深さ方向分離畳み込み
print(f"標準畳み込み: {standard_conv_params:,}")
print(f"深さ方向分離畳み込み: {dw_sep_params:,}")
print(f"削減率: {(1 - dw_sep_params/standard_conv_params):.1%}")

EfficientNet（2019年、Tan）— 複合スケーリング

EfficientNetは複合係数を使って幅、深さ、解像度を同時にスケーリングし、当時最高の精度効率トレードオフを達成しました。

# EfficientNetのスケーリング係数
efficientnet_params = {
    'b0': (1.0, 1.0, 224, 0.2),
    'b1': (1.0, 1.1, 240, 0.2),
    'b2': (1.1, 1.2, 260, 0.3),
    'b3': (1.2, 1.4, 300, 0.3),
    'b4': (1.4, 1.8, 380, 0.4),
    'b5': (1.6, 2.2, 456, 0.4),
    'b6': (1.8, 2.6, 528, 0.5),
    'b7': (2.0, 3.1, 600, 0.5),
}

# (width_coeff, depth_coeff, resolution, dropout_rate)
print("EfficientNetスケーリングパラメータ:")
for version, (w, d, r, drop) in efficientnet_params.items():
    print(f"  B{version[1]}: 幅={w:.1f}, 深さ={d:.1f}, 解像度={r}, dropout={drop}")

ConvNeXt（2022年、Liu）— 2020年代のConvNet

ConvNeXtはVision Transformerのアイデア（大きなカーネル、LayerNorm、GELU、反転ボトルネック）をCNNに取り込み、Transformer並みの性能を達成しました。

import torch
import torch.nn as nn

class ConvNeXtBlock(nn.Module):
    """ConvNeXtブロック"""

    def __init__(self, dim, layer_scale_init_value=1e-6):
        super(ConvNeXtBlock, self).__init__()

        # 大きなカーネル（7x7）による深さ方向畳み込み
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)

        # LayerNorm
        self.norm = nn.LayerNorm(dim, eps=1e-6)

        # 反転ボトルネック（4倍チャネル拡張）
        self.pwconv1 = nn.Linear(dim, 4 * dim)
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)

        # レイヤースケール
        self.gamma = nn.Parameter(
            layer_scale_init_value * torch.ones(dim),
            requires_grad=True
        ) if layer_scale_init_value > 0 else None

    def forward(self, x):
        identity = x

        x = self.dwconv(x)
        # (N, C, H, W) -> (N, H, W, C) for LayerNorm
        x = x.permute(0, 2, 3, 1)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)

        if self.gamma is not None:
            x = self.gamma * x

        # (N, H, W, C) -> (N, C, H, W)
        x = x.permute(0, 3, 1, 2)

        return identity + x

3. Vision Transformer（ViT）

ViTは画像をパッチに分割してTransformerを適用し、各パッチをトークンとして扱います。従来のCNNとは根本的に異なるパラダイムです。

import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    """画像をパッチ埋め込みに変換"""

    def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super(PatchEmbedding, self).__init__()
        self.num_patches = (image_size // patch_size) ** 2

        # 単一の畳み込みでパッチ抽出と埋め込みを実行
        self.projection = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

    def forward(self, x):
        x = self.projection(x)   # (B, embed_dim, H/p, W/p)
        x = x.flatten(2)         # (B, embed_dim, num_patches)
        x = x.transpose(1, 2)    # (B, num_patches, embed_dim)
        return x


class VisionTransformer(nn.Module):
    """Vision Transformer（ViT）"""

    def __init__(self, image_size=224, patch_size=16, in_channels=3,
                 num_classes=1000, embed_dim=768, depth=12, num_heads=12,
                 mlp_ratio=4.0, dropout=0.0):
        super(VisionTransformer, self).__init__()

        self.patch_embed = PatchEmbedding(image_size, patch_size, in_channels, embed_dim)
        num_patches = self.patch_embed.num_patches

        # CLSトークン + 位置埋め込み
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embedding = nn.Parameter(
            torch.zeros(1, num_patches + 1, embed_dim)
        )
        self.pos_dropout = nn.Dropout(dropout)

        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes)

        self._init_weights()

    def _init_weights(self):
        nn.init.trunc_normal_(self.pos_embedding, std=0.02)
        nn.init.trunc_normal_(self.cls_token, std=0.02)
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.trunc_normal_(m.weight, std=0.02)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)

    def forward(self, x):
        B = x.shape[0]

        x = self.patch_embed(x)  # (B, num_patches, embed_dim)

        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)  # CLSトークンを先頭に追加

        x = x + self.pos_embedding
        x = self.pos_dropout(x)

        x = self.norm(x)

        cls_output = x[:, 0]
        return self.head(cls_output)


def vit_base(num_classes=1000):
    return VisionTransformer(
        image_size=224, patch_size=16, embed_dim=768, depth=12,
        num_heads=12, num_classes=num_classes
    )


model = vit_base()
x = torch.randn(2, 3, 224, 224)
out = model(x)
params = sum(p.numel() for p in model.parameters())
print(f"ViT-Base 出力: {out.shape}, パラメータ数: {params:,}")

4. 物体検出：YOLO

import torch
import torch.nn as nn

class YOLOHead(nn.Module):
    """簡略化されたYOLO検出ヘッド"""

    def __init__(self, in_channels, num_anchors, num_classes):
        super(YOLOHead, self).__init__()
        self.num_anchors = num_anchors
        self.num_classes = num_classes

        # 予測: (x, y, w, h, objectness, num_classes) * num_anchors
        out_channels = num_anchors * (5 + num_classes)

        self.head = nn.Sequential(
            nn.Conv2d(in_channels, in_channels * 2, kernel_size=3, padding=1),
            nn.BatchNorm2d(in_channels * 2),
            nn.LeakyReLU(0.1),
            nn.Conv2d(in_channels * 2, out_channels, kernel_size=1)
        )

    def forward(self, x):
        out = self.head(x)
        B, C, H, W = out.shape
        out = out.reshape(B, self.num_anchors, 5 + self.num_classes, H, W)
        out = out.permute(0, 1, 3, 4, 2).contiguous()
        return out

5. セマンティックセグメンテーション：U-Net

import torch
import torch.nn as nn
import torch.nn.functional as F

class UNet(nn.Module):
    """医療画像セグメンテーション用U-Net"""

    def __init__(self, in_channels=1, num_classes=2, features=[64, 128, 256, 512]):
        super(UNet, self).__init__()

        self.encoders = nn.ModuleList()
        self.decoders = nn.ModuleList()
        self.pool = nn.MaxPool2d(2, 2)

        # エンコーダーパス
        for feature in features:
            self.encoders.append(nn.Sequential(
                nn.Conv2d(in_channels, feature, 3, padding=1),
                nn.BatchNorm2d(feature),
                nn.ReLU(inplace=True),
                nn.Conv2d(feature, feature, 3, padding=1),
                nn.BatchNorm2d(feature),
                nn.ReLU(inplace=True)
            ))
            in_channels = feature

        # ボトルネック
        self.bottleneck = nn.Sequential(
            nn.Conv2d(features[-1], features[-1] * 2, 3, padding=1),
            nn.BatchNorm2d(features[-1] * 2),
            nn.ReLU(inplace=True),
            nn.Conv2d(features[-1] * 2, features[-1] * 2, 3, padding=1),
            nn.BatchNorm2d(features[-1] * 2),
            nn.ReLU(inplace=True)
        )

        self.final_conv = nn.Conv2d(features[0], num_classes, kernel_size=1)

    def forward(self, x):
        skip_connections = []

        # エンコーダー
        for encoder in self.encoders:
            x = encoder(x)
            skip_connections.append(x)
            x = self.pool(x)

        x = self.bottleneck(x)

        return self.final_conv(x)


model = UNet(in_channels=1, num_classes=2)
print(f"U-Net モデル作成完了")

6. 転移学習の実践

torchvision.modelsの使用

import torch
import torch.nn as nn
import torchvision.models as models
import torch.optim as optim
from tqdm import tqdm

# 事前学習済みモデルのロード
model_resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model_efficientnet = models.efficientnet_b4(weights=models.EfficientNet_B4_Weights.DEFAULT)
model_vit = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)


def feature_extraction(num_classes, freeze=True):
    """特徴抽出: バックボーンを凍結し、分類器のみを学習"""
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

    if freeze:
        for param in model.parameters():
            param.requires_grad = False

    # 分類器を置き換え
    in_features = model.fc.in_features
    model.fc = nn.Sequential(
        nn.Dropout(0.5),
        nn.Linear(in_features, 256),
        nn.ReLU(),
        nn.Linear(256, num_classes)
    )

    for param in model.fc.parameters():
        param.requires_grad = True

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"学習可能パラメータ: {trainable:,} / {total:,} ({trainable/total:.1%})")

    return model

アーキテクチャ性能比較

モデル	年	Top-1精度	パラメータ数	FLOPs
LeNet-5	1998	~99% (MNIST)	60K	-
AlexNet	2012	56.5%	61M	724M
VGG-16	2014	71.6%	138M	15.5G
GoogLeNet	2014	68.7%	6.8M	1.5G
ResNet-50	2015	75.3%	25M	4.1G
DenseNet-121	2017	74.4%	8M	2.9G
MobileNetV2	2018	71.8%	3.4M	300M
EfficientNet-B0	2019	77.1%	5.3M	390M
ConvNeXt-T	2022	82.1%	28M	4.5G
ViT-B/16	2020	81.8%	86M	17.6G

まとめ

CNNアーキテクチャは目覚ましい進化を遂げてきました：

LeNet（1998年）: 最初の実用的CNN、基本構造を確立
AlexNet（2012年）: 深層学習の復興、ReLUとDropoutを導入
VGGNet（2014年）: 3x3畳み込みの力、深さの重要性を証明
ResNet（2015年）: 残差接続が勾配消失問題を解決
DenseNet（2017年）: 密な接続で特徴の再利用を最大化
MobileNet（2017年）: 深さ方向分離畳み込みでモバイル展開を実現
EfficientNet（2019年）: 複合スケーリングで最高の効率を達成
ConvNeXt（2022年）: Transformerにインスパイアされた現代的CNN設計
ViT（2020年）: 画像をシーケンスとして扱う新しいパラダイム

実践では、torchvisionの事前学習済みモデルから始めて転移学習を適用し、対象タスクに素早く適応させることが推奨されます。

参考文献

PyTorch Vision Models
ResNetの論文: He et al., "Deep Residual Learning for Image Recognition" (arXiv:1512.03385)
EfficientNetの論文: Tan and Le, "EfficientNet: Rethinking Model Scaling" (arXiv:1905.11946)
ViTの論文: Dosovitskiy et al., "An Image is Worth 16x16 Words" (arXiv:2010.11929)
ConvNeXtの論文: Liu et al., "A ConvNet for the 2020s" (arXiv:2201.03545)