Split View: PyTorch 완전 정복 가이드: Zero to Hero — 텐서부터 분산 학습까지

PyTorch 완전 정복 가이드: Zero to Hero — 텐서부터 분산 학습까지

들어가며
1. 환경 설정
- PyTorch 설치
- GPU 사용 가능 여부 확인
2. 텐서(Tensor) 기초
- 텐서 생성
- 텐서 속성 및 타입 변환
- 텐서 형태 변환
- 텐서 연산
- 브로드캐스팅
- 인덱싱과 슬라이싱
3. 자동 미분(Autograd)
- requires_grad와 연산 그래프
- 다차원 텐서에서의 역전파
- 그래디언트 제어
- 고계 미분
4. nn.Module — 신경망 구축의 기반
- Sequential, ModuleList, ModuleDict
5. 선형 회귀 구현
6. 다층 퍼셉트론(MLP) — MNIST 분류
7. 합성곱 신경망(CNN) — CIFAR-10 분류
8. 순환 신경망(RNN/LSTM) — 시계열 처리
9. Transformer 구현 — Multi-head Attention from Scratch
10. 데이터 로딩 — Dataset, DataLoader
11. 옵티마이저 — SGD, Adam, AdamW 비교
12. 학습률 스케줄러
13. 정규화 기법 — Dropout, BatchNorm, LayerNorm
14. 전이학습(Transfer Learning)
15. 모델 저장과 로딩
16. TorchScript와 모델 배포
17. 분산 학습(DDP) — DistributedDataParallel
- torchrun으로 실행
- DataParallel vs DistributedDataParallel
18. 고급 기법 모음
- 혼합 정밀도 학습 (Mixed Precision)
- 그래디언트 클리핑
- 재현성(Reproducibility) 설정
마치며
- 참고 자료

들어가며

딥러닝 프레임워크의 양대 산맥이었던 TensorFlow와 PyTorch 중, 연구자와 엔지니어 모두에게 사랑받는 프레임워크는 단연 PyTorch입니다. 2016년 Facebook AI Research(현 Meta AI)가 공개한 이후 PyTorch는 학술 논문 구현의 표준이 되었고, 현재는 산업 현장에서도 TensorFlow를 앞서는 점유율을 기록하고 있습니다.

이 가이드는 Python 기초 지식이 있는 독자를 대상으로, PyTorch를 처음 접하는 단계부터 분산 학습까지 체계적으로 다룹니다. 각 섹션에는 실제로 실행 가능한 코드 예제와 공식 문서 링크를 포함하여, 읽고 바로 실습할 수 있도록 구성했습니다.

공식 문서: https://pytorch.org/docs/stable/index.html 공식 튜토리얼: https://pytorch.org/tutorials/

1. 환경 설정

PyTorch 설치

PyTorch는 pip 또는 conda로 설치합니다. GPU를 사용하려면 CUDA 버전에 맞는 패키지를 선택해야 합니다.

pip으로 설치 (CUDA 12.1 기준):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

conda로 설치:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

CPU 전용 설치:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

GPU 사용 가능 여부 확인

import torch

# PyTorch 버전 확인
print(f"PyTorch 버전: {torch.__version__}")

# CUDA 사용 가능 여부 확인
print(f"CUDA 사용 가능: {torch.cuda.is_available()}")

# GPU 개수 확인
if torch.cuda.is_available():
    print(f"GPU 개수: {torch.cuda.device_count()}")
    print(f"현재 GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU 메모리: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Apple Silicon (M1/M2/M3) MPS 확인
print(f"MPS 사용 가능: {torch.backends.mps.is_available()}")

# 사용할 디바이스 설정 (자동 선택)
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print(f"사용 디바이스: {device}")

설치와 기기 확인이 완료됐다면 이제 PyTorch의 핵심 자료구조인 텐서를 살펴봅시다.

2. 텐서(Tensor) 기초

텐서는 PyTorch의 핵심 자료구조입니다. NumPy의 ndarray와 유사하지만 GPU에서 연산이 가능하고 자동 미분을 지원한다는 점에서 차별화됩니다.

공식 문서: https://pytorch.org/docs/stable/tensors.html

텐서 생성

import torch
import numpy as np

# 직접 데이터로부터 생성
t1 = torch.tensor([1, 2, 3, 4, 5])
print(f"1D 텐서: {t1}, shape: {t1.shape}, dtype: {t1.dtype}")

# 2D 텐서 (행렬)
t2 = torch.tensor([[1.0, 2.0, 3.0],
                   [4.0, 5.0, 6.0]])
print(f"2D 텐서:\n{t2}, shape: {t2.shape}")

# 특수 텐서 생성
zeros = torch.zeros(3, 4)          # 모두 0
ones = torch.ones(2, 3)            # 모두 1
rand = torch.rand(3, 3)            # 0~1 균등분포
randn = torch.randn(3, 3)          # 표준정규분포
eye = torch.eye(4)                  # 단위행렬
arange = torch.arange(0, 10, 2)    # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5) # 균등 간격 5개

print(f"zeros:\n{zeros}")
print(f"randn:\n{randn}")

# 기존 텐서와 같은 크기로 생성
t3 = torch.zeros_like(t2)
t4 = torch.ones_like(t2)
t5 = torch.rand_like(t2)

# NumPy 배열로부터 생성 (메모리 공유)
np_arr = np.array([1.0, 2.0, 3.0])
t_from_np = torch.from_numpy(np_arr)
print(f"NumPy로부터: {t_from_np}")

# 텐서를 NumPy로 변환 (CPU에서만 가능)
np_from_t = t1.numpy()

텐서 속성 및 타입 변환

t = torch.rand(3, 4, 5)

# 기본 속성
print(f"shape: {t.shape}")       # torch.Size([3, 4, 5])
print(f"ndim: {t.ndim}")         # 3
print(f"dtype: {t.dtype}")       # torch.float32
print(f"device: {t.device}")     # cpu
print(f"numel: {t.numel()}")     # 60 (총 원소 수)

# 데이터 타입 변환
t_int = t.to(torch.int32)
t_long = t.long()           # torch.int64
t_float = t.float()         # torch.float32
t_double = t.double()       # torch.float64
t_half = t.half()           # torch.float16

# GPU로 이동
if torch.cuda.is_available():
    t_gpu = t.to("cuda")
    t_gpu2 = t.cuda()        # 동일한 결과
    t_back = t_gpu.cpu()     # 다시 CPU로

텐서 형태 변환

t = torch.arange(24)  # 0~23까지 1D 텐서

# reshape: 원소 수가 같으면 모든 형태 가능
t_2d = t.reshape(4, 6)
t_3d = t.reshape(2, 3, 4)
t_auto = t.reshape(6, -1)  # -1은 자동 계산 (6x4)

# view: reshape와 유사하지만 메모리 연속성 필요
t_view = t.view(3, 8)

# squeeze/unsqueeze: 차원 제거/추가
t = torch.zeros(1, 3, 1, 4)
print(f"원본 shape: {t.shape}")  # [1, 3, 1, 4]

t_sq = t.squeeze()       # 크기 1인 차원 제거 → [3, 4]
t_sq1 = t.squeeze(0)     # 0번 차원만 제거 → [3, 1, 4]
t_unsq = t_sq.unsqueeze(0)  # 0번 위치에 차원 추가 → [1, 3, 4]

# transpose/permute: 차원 순서 변경
t = torch.rand(2, 3, 4)
t_T = t.transpose(0, 1)    # [3, 2, 4]
t_perm = t.permute(2, 0, 1)  # [4, 2, 3]

# contiguous: permute 후 연속 메모리 보장
t_cont = t_perm.contiguous()

print(f"squeeze: {t_sq.shape}")
print(f"permute: {t_perm.shape}")

텐서 연산

a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

# 사칙연산 (원소별)
print(a + b)     # 또는 torch.add(a, b)
print(a - b)     # 또는 torch.sub(a, b)
print(a * b)     # 원소별 곱 (Hadamard product)
print(a / b)     # 원소별 나눗셈
print(a ** 2)    # 원소별 거듭제곱

# 행렬 곱
matmul = a @ b          # 또는 torch.matmul(a, b)
mm = torch.mm(a, b)     # 2D 전용

print(f"행렬 곱:\n{matmul}")

# 집계 연산
t = torch.rand(3, 4)
print(f"합계: {t.sum()}")
print(f"평균: {t.mean()}")
print(f"최대: {t.max()}")
print(f"최소: {t.min()}")
print(f"표준편차: {t.std()}")

# 축 지정 집계
print(f"행별 합: {t.sum(dim=0)}")  # 각 열의 합 (행 방향)
print(f"열별 합: {t.sum(dim=1)}")  # 각 행의 합 (열 방향)
print(f"keepdim:\n{t.sum(dim=1, keepdim=True)}")

# argmax/argmin
print(f"최대값 인덱스: {t.argmax()}")
print(f"행별 최대값 인덱스: {t.argmax(dim=1)}")

브로드캐스팅

NumPy와 동일한 브로드캐스팅 규칙을 따릅니다. 크기가 다른 텐서 간 연산 시 자동으로 확장됩니다.

# 브로드캐스팅 예시
a = torch.tensor([[1, 2, 3],
                  [4, 5, 6]])   # shape: [2, 3]
b = torch.tensor([10, 20, 30])  # shape: [3]

# b가 [2, 3]으로 자동 확장되어 연산
print(a + b)
# tensor([[11, 22, 33],
#         [14, 25, 36]])

# 스칼라 연산도 브로드캐스팅
print(a * 2)   # 모든 원소에 2 곱하기
print(a + 100) # 모든 원소에 100 더하기

# 열 벡터 + 행 벡터
col = torch.tensor([[1], [2], [3]])  # shape: [3, 1]
row = torch.tensor([10, 20, 30])      # shape: [3]
print(col + row)  # shape: [3, 3] — 외적과 유사

인덱싱과 슬라이싱

t = torch.arange(24).reshape(2, 3, 4).float()

# 기본 인덱싱
print(t[0])        # 첫 번째 행렬 (shape: [3, 4])
print(t[0, 1])     # [3, 4] 행렬의 두 번째 행 (shape: [4])
print(t[0, 1, 2])  # 스칼라

# 슬라이싱
print(t[:, 1:, :2])  # 전체, 1번 이후, 처음 2개 열

# 고급 인덱싱 (Fancy indexing)
indices = torch.tensor([0, 2])
print(t[:, indices, :])  # 0번, 2번 행만 선택

# 조건부 인덱싱 (Boolean masking)
mask = t > 10
print(t[mask])  # 10보다 큰 원소만 추출 (1D 텐서 반환)

# where: 조건에 따라 두 텐서에서 선택
a = torch.tensor([1.0, 2.0, 3.0, 4.0])
b = torch.tensor([10.0, 20.0, 30.0, 40.0])
condition = a > 2
result = torch.where(condition, b, a)
print(result)  # tensor([ 1.,  2., 30., 40.])

3. 자동 미분(Autograd)

PyTorch의 핵심 기능 중 하나인 Autograd는 연산 그래프를 자동으로 구축하고, 역전파(backpropagation)를 통해 그래디언트를 계산합니다.

공식 문서: https://pytorch.org/docs/stable/autograd.html

requires_grad와 연산 그래프

import torch

# requires_grad=True로 텐서 생성 → 연산 추적 시작
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

# 연산 수행 → 연산 그래프 구축
z = x ** 2 + 2 * x * y + y ** 2  # (x + y)^2

print(f"z = {z}")  # z = 49.0

# 역전파 수행
z.backward()

# 그래디언트 확인
# dz/dx = 2x + 2y = 2*3 + 2*4 = 14
print(f"dz/dx = {x.grad}")  # 14.0

# dz/dy = 2x + 2y = 14
print(f"dz/dy = {y.grad}")  # 14.0

다차원 텐서에서의 역전파

# 벡터 함수의 그래디언트
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2  # [1, 4, 9]
z = y.sum()  # 스칼라로 축소

z.backward()
print(f"x.grad: {x.grad}")  # [2, 4, 6] (dy/dx = 2x)

# gradient 인자: 비스칼라 backward
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2  # [1, 4, 9]

# y.backward()는 에러 → y는 스칼라가 아님
# gradient 인자로 가중합 계산
grad_output = torch.tensor([1.0, 1.0, 1.0])  # 각 원소의 가중치
y.backward(gradient=grad_output)
print(f"x.grad: {x.grad}")  # [2, 4, 6]

그래디언트 제어

# 그래디언트 누적 문제 — 초기화 필요
x = torch.tensor(2.0, requires_grad=True)

for i in range(3):
    y = x ** 2
    y.backward()
    print(f"iteration {i}: x.grad = {x.grad}")
    # 매번 초기화하지 않으면 누적됨
    x.grad.zero_()  # in-place 초기화

# no_grad: 추론 시 그래디언트 비활성화 (메모리 절약)
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

with torch.no_grad():
    y = x ** 2  # 연산 그래프 미생성
    print(f"y.requires_grad: {y.requires_grad}")  # False

# detach: 연산 그래프에서 분리
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
z = y.detach()  # 그래디언트 추적 분리
print(f"z.requires_grad: {z.requires_grad}")  # False

# 파라미터 일부 고정 (Transfer Learning 시 유용)
for param in model.parameters():
    param.requires_grad = False

고계 미분

# 2차 미분 예시
x = torch.tensor(3.0, requires_grad=True)
y = x ** 4

# 1차 미분: dy/dx = 4x^3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"1차 미분: {dy_dx}")  # 108

# 2차 미분: d2y/dx2 = 12x^2
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"2차 미분: {d2y_dx2}")  # 108

4. nn.Module — 신경망 구축의 기반

torch.nn.Module은 모든 PyTorch 모델의 기반 클래스입니다. 레이어, 활성화 함수, 전체 모델 모두 이 클래스를 상속합니다.

공식 문서: https://pytorch.org/docs/stable/nn.html

import torch
import torch.nn as nn

# 간단한 모델 정의
class SimpleModel(nn.Module):
    def __init__(self, in_features, hidden_size, out_features):
        super().__init__()
        # 레이어 정의 (파라미터 자동 등록)
        self.fc1 = nn.Linear(in_features, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, out_features)
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        # 순전파 정의
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# 모델 인스턴스 생성
model = SimpleModel(784, 256, 10)
print(model)

# 파라미터 수 확인
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"전체 파라미터: {total_params:,}")
print(f"학습 가능 파라미터: {trainable_params:,}")

# 파라미터 접근
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

# 순전파 실행
x = torch.randn(32, 784)  # batch_size=32, features=784
output = model(x)
print(f"출력 shape: {output.shape}")  # [32, 10]

Sequential, ModuleList, ModuleDict

# Sequential: 순차적 레이어 구성
seq_model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# ModuleList: 리스트로 레이어 관리
class ResidualBlock(nn.Module):
    def __init__(self, num_blocks, hidden_size):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(hidden_size, hidden_size)
            for _ in range(num_blocks)
        ])
        self.relu = nn.ReLU()

    def forward(self, x):
        for layer in self.layers:
            x = self.relu(layer(x)) + x  # 잔차 연결
        return x

# ModuleDict: 딕셔너리로 레이어 관리
class MultiTaskModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.Linear(784, 256)
        self.heads = nn.ModuleDict({
            'classification': nn.Linear(256, 10),
            'regression': nn.Linear(256, 1)
        })

    def forward(self, x, task='classification'):
        features = torch.relu(self.backbone(x))
        return self.heads[task](features)

5. 선형 회귀 구현

선형 회귀는 딥러닝의 가장 기본적인 모델입니다. 처음부터 구현하며 PyTorch의 학습 루프를 이해합니다.

import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

# 데이터 생성
torch.manual_seed(42)
n_samples = 200

# y = 3x + 2 + 노이즈
X = torch.linspace(-5, 5, n_samples).unsqueeze(1)  # [200, 1]
y_true = 3 * X + 2
y = y_true + torch.randn_like(y_true) * 0.5       # 노이즈 추가

# 모델 정의
class LinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

model = LinearRegression()

# 손실 함수와 옵티마이저
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 학습 루프
n_epochs = 1000
losses = []

for epoch in range(n_epochs):
    # 1. 순전파 (Forward Pass)
    y_pred = model(X)

    # 2. 손실 계산
    loss = criterion(y_pred, y)
    losses.append(loss.item())

    # 3. 그래디언트 초기화 (중요!)
    optimizer.zero_grad()

    # 4. 역전파 (Backward Pass)
    loss.backward()

    # 5. 파라미터 업데이트
    optimizer.step()

    if (epoch + 1) % 200 == 0:
        w = model.linear.weight.item()
        b = model.linear.bias.item()
        print(f"Epoch {epoch+1}: Loss={loss.item():.4f}, w={w:.4f}, b={b:.4f}")

# 결과 확인
print(f"\n학습된 가중치: {model.linear.weight.item():.4f} (정답: 3.0)")
print(f"학습된 편향: {model.linear.bias.item():.4f} (정답: 2.0)")

6. 다층 퍼셉트론(MLP) — MNIST 분류

MNIST 손글씨 숫자 데이터셋으로 완전한 분류 모델을 구축합니다.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# 하이퍼파라미터
BATCH_SIZE = 64
LEARNING_RATE = 0.001
N_EPOCHS = 10
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 데이터 전처리 및 로딩
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST 평균, 표준편차
])

train_dataset = datasets.MNIST(root='./data', train=True,
                                download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False,
                               transform=transform)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                          shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                         shuffle=False, num_workers=2)

# MLP 모델 정의
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Flatten(),              # 28x28 → 784
            nn.Linear(784, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.network(x)

model = MLP().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# 학습 함수
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(loader):
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        pred = output.argmax(dim=1)
        correct += pred.eq(target).sum().item()
        total += target.size(0)

    return total_loss / len(loader), 100.0 * correct / total

# 평가 함수
def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            total_loss += criterion(output, target).item()
            pred = output.argmax(dim=1)
            correct += pred.eq(target).sum().item()
            total += target.size(0)

    return total_loss / len(loader), 100.0 * correct / total

# 학습 실행
for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, DEVICE)
    test_loss, test_acc = evaluate(model, test_loader, criterion, DEVICE)
    print(f"Epoch {epoch+1}/{N_EPOCHS} | "
          f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | "
          f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")

7. 합성곱 신경망(CNN) — CIFAR-10 분류

이미지 분류의 핵심인 CNN을 구현하고, CIFAR-10 데이터셋으로 학습합니다.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# CIFAR-10 데이터 준비
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010))
])

train_data = datasets.CIFAR10('./data', train=True, download=True, transform=transform_train)
test_data = datasets.CIFAR10('./data', train=False, transform=transform_test)

train_loader = DataLoader(train_data, batch_size=128, shuffle=True, num_workers=4)
test_loader = DataLoader(test_data, batch_size=128, shuffle=False, num_workers=4)

CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck']

# CNN 모델 정의 (VGG 스타일)
class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # 특징 추출부
        self.features = nn.Sequential(
            # Block 1: 3 → 64
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),      # 32x32 → 16x16
            nn.Dropout2d(0.1),

            # Block 2: 64 → 128
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),      # 16x16 → 8x8
            nn.Dropout2d(0.2),

            # Block 3: 128 → 256
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),      # 8x8 → 4x4
        )

        # 분류부
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(256 * 4 * 4, 1024),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(1024, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

model = CNN().to(DEVICE)
print(f"모델 파라미터 수: {sum(p.numel() for p in model.parameters()):,}")

8. 순환 신경망(RNN/LSTM) — 시계열 처리

시계열 데이터나 텍스트 처리에 적합한 RNN과 LSTM을 구현합니다.

import torch
import torch.nn as nn
import numpy as np

# LSTM 기반 시계열 예측 모델
class LSTMPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2,
                 output_size=1, dropout=0.2):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # LSTM 레이어
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,     # 입력: [batch, seq_len, features]
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=False
        )

        # 출력 레이어
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 32),
            nn.ReLU(),
            nn.Linear(32, output_size)
        )

    def forward(self, x):
        # x shape: [batch_size, seq_len, input_size]
        batch_size = x.size(0)

        # 초기 hidden/cell state (0으로 초기화)
        h0 = torch.zeros(self.num_layers, batch_size,
                         self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, batch_size,
                         self.hidden_size).to(x.device)

        # LSTM 순전파
        # out: [batch_size, seq_len, hidden_size]
        out, (hn, cn) = self.lstm(x, (h0, c0))

        # 마지막 시퀀스의 출력만 사용
        out = self.fc(out[:, -1, :])  # [batch_size, output_size]
        return out

# 사인파 데이터로 예시
t = np.linspace(0, 100, 1000)
data = np.sin(0.5 * t) + 0.1 * np.random.randn(1000)
data = torch.FloatTensor(data).unsqueeze(1)

# 시퀀스 데이터 생성 함수
def create_sequences(data, seq_len=50):
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i:i+seq_len])
        y.append(data[i+seq_len])
    return torch.stack(X), torch.stack(y)

X, y = create_sequences(data, seq_len=50)
print(f"X shape: {X.shape}")  # [950, 50, 1]
print(f"y shape: {y.shape}")  # [950, 1]

# GRU — LSTM보다 파라미터 적은 변형
class GRUPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2):
        super().__init__()
        self.gru = nn.GRU(input_size, hidden_size, num_layers,
                          batch_first=True, dropout=0.2)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        out, _ = self.gru(x)
        return self.fc(out[:, -1, :])

9. Transformer 구현 — Multi-head Attention from Scratch

Attention Is All You Need 논문의 핵심 구성 요소를 직접 구현합니다.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # 헤드당 차원

        # Q, K, V, 출력 프로젝션
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)
        self.scale = math.sqrt(self.d_k)

    def split_heads(self, x):
        # x: [batch, seq, d_model] → [batch, num_heads, seq, d_k]
        batch, seq, _ = x.shape
        x = x.view(batch, seq, self.num_heads, self.d_k)
        return x.transpose(1, 2)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # 1. Q, K, V 선형 변환 및 헤드 분리
        Q = self.split_heads(self.W_q(query))  # [B, H, Sq, dk]
        K = self.split_heads(self.W_k(key))    # [B, H, Sk, dk]
        V = self.split_heads(self.W_v(value))  # [B, H, Sk, dk]

        # 2. Scaled Dot-Product Attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        # scores: [B, H, Sq, Sk]

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # 3. Value와 가중합
        context = torch.matmul(attn_weights, V)  # [B, H, Sq, dk]

        # 4. 헤드 합치기
        context = context.transpose(1, 2).contiguous()
        context = context.view(batch_size, -1, self.d_model)

        # 5. 출력 프로젝션
        output = self.W_o(context)
        return output, attn_weights

class FeedForward(nn.Module):
    def __init__(self, d_model=512, d_ff=2048, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )

    def forward(self, x):
        return self.net(x)

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.ff = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-Norm 방식 (원 논문은 Post-Norm)
        attn_out, _ = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_out))

        ff_out = self.ff(x)
        x = self.norm2(x + self.dropout(ff_out))
        return x

class PositionalEncoding(nn.Module):
    def __init__(self, d_model=512, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        # 위치 인코딩 계산
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # [1, max_len, d_model]
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

# 사용 예시
d_model = 512
encoder_layer = TransformerEncoderLayer(d_model=d_model, num_heads=8)
pos_enc = PositionalEncoding(d_model=d_model)

x = torch.randn(2, 10, d_model)  # [batch=2, seq=10, d_model=512]
x = pos_enc(x)
output = encoder_layer(x)
print(f"Transformer Encoder 출력: {output.shape}")  # [2, 10, 512]

10. 데이터 로딩 — Dataset, DataLoader

효율적인 데이터 파이프라인 구축은 학습 속도와 직결됩니다.

공식 튜토리얼: https://pytorch.org/tutorials/beginner/basics/intro.html

import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from PIL import Image
import os

# 커스텀 Dataset 구현
class CustomImageDataset(Dataset):
    def __init__(self, csv_file, img_dir, transform=None):
        """
        csv_file: 이미지 경로와 레이블이 있는 CSV
        img_dir: 이미지 루트 디렉토리
        transform: torchvision transforms
        """
        self.annotations = pd.read_csv(csv_file)
        self.img_dir = img_dir
        self.transform = transform

    def __len__(self):
        # 데이터셋 크기 반환 (필수)
        return len(self.annotations)

    def __getitem__(self, idx):
        # 인덱스로 샘플 반환 (필수)
        img_path = os.path.join(self.img_dir, self.annotations.iloc[idx, 0])
        image = Image.open(img_path).convert('RGB')
        label = int(self.annotations.iloc[idx, 1])

        if self.transform:
            image = self.transform(image)

        return image, label

# 수치 데이터용 Dataset
class TabularDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.FloatTensor(X)
        self.y = torch.LongTensor(y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# DataLoader 고급 활용
dataset = TabularDataset(
    X=np.random.randn(1000, 20),
    y=np.random.randint(0, 5, 1000)
)

# 기본 DataLoader
basic_loader = DataLoader(dataset, batch_size=32, shuffle=True)

# 고급 설정
advanced_loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,        # 병렬 데이터 로딩 (CPU 코어 수에 맞게)
    pin_memory=True,      # GPU 전송 가속 (CUDA 사용 시)
    drop_last=True,       # 마지막 불완전 배치 제거
    prefetch_factor=2,    # 미리 로드할 배치 수
    persistent_workers=True  # 워커 프로세스 재사용
)

# 배치 확인
for batch_X, batch_y in advanced_loader:
    print(f"배치 X: {batch_X.shape}")  # [64, 20]
    print(f"배치 y: {batch_y.shape}")  # [64]
    break

# WeightedRandomSampler: 클래스 불균형 처리
from torch.utils.data import WeightedRandomSampler

class_counts = [800, 150, 50]  # 클래스별 샘플 수
weights = 1.0 / torch.tensor(class_counts, dtype=torch.float)
# 각 샘플에 클래스 가중치 할당
sample_weights = weights[dataset.y]  # 각 샘플의 가중치

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(dataset),
    replacement=True
)

balanced_loader = DataLoader(dataset, batch_size=32, sampler=sampler)

11. 옵티마이저 — SGD, Adam, AdamW 비교

공식 문서: https://pytorch.org/docs/stable/optim.html

import torch.optim as optim

# 모델 예시
model = nn.Linear(100, 10)

# SGD (Stochastic Gradient Descent)
sgd = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,      # 이전 업데이트 방향 유지
    weight_decay=1e-4, # L2 정규화
    nesterov=True      # Nesterov momentum
)

# Adam: 적응형 학습률
adam = optim.Adam(
    model.parameters(),
    lr=0.001,
    betas=(0.9, 0.999),  # 1차, 2차 모멘트 감쇠율
    eps=1e-8,
    weight_decay=0
)

# AdamW: Adam + 올바른 Weight Decay
# 주의: Adam의 weight_decay는 L2 정규화와 다름
# AdamW가 Transformer 계열 모델에 권장됨
adamw = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    weight_decay=0.01   # 학습률과 독립적으로 적용
)

# RMSprop: 순환 신경망에 효과적
rmsprop = optim.RMSprop(
    model.parameters(),
    lr=0.01,
    alpha=0.99,
    momentum=0.0
)

# 파라미터 그룹별 다른 학습률 설정 (Transfer Learning에 유용)
optimizer = optim.Adam([
    {'params': model.features.parameters(), 'lr': 1e-4},  # 백본: 낮은 LR
    {'params': model.classifier.parameters(), 'lr': 1e-3} # 헤드: 높은 LR
], lr=1e-3)

# 옵티마이저 상태 저장/복원
checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch': 10
}
torch.save(checkpoint, 'checkpoint.pt')

# 복원
ckpt = torch.load('checkpoint.pt')
model.load_state_dict(ckpt['model'])
optimizer.load_state_dict(ckpt['optimizer'])
start_epoch = ckpt['epoch']

12. 학습률 스케줄러

고정 학습률보다 스케줄러를 사용하면 대부분의 경우 성능이 향상됩니다.

import torch.optim as optim
from torch.optim.lr_scheduler import (
    StepLR, MultiStepLR, ExponentialLR,
    CosineAnnealingLR, OneCycleLR,
    ReduceLROnPlateau, CosineAnnealingWarmRestarts
)

model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)

# StepLR: 매 step_size 에포크마다 gamma 배 감소
step_scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# 0~29: lr=0.1, 30~59: lr=0.01, 60~89: lr=0.001

# MultiStepLR: 지정 에포크에서 감소
multi_scheduler = MultiStepLR(optimizer, milestones=[50, 100, 150], gamma=0.1)

# CosineAnnealingLR: 코사인 주기로 감소
cosine_scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# ReduceLROnPlateau: 검증 손실 개선 없을 때 감소 (가장 실용적)
plateau_scheduler = ReduceLROnPlateau(
    optimizer,
    mode='min',       # 최소화 목표 (loss)
    factor=0.5,       # 감소 비율
    patience=10,      # 개선 없는 에포크 수 허용
    min_lr=1e-7,
    verbose=True
)

# OneCycleLR: 빠른 수렴 (슈퍼 수렴)
one_cycle = OneCycleLR(
    optimizer,
    max_lr=0.01,
    steps_per_epoch=100,  # len(train_loader)
    epochs=30,
    pct_start=0.3,        # 전체의 30%를 warm-up에 사용
    anneal_strategy='cos'
)

# CosineAnnealingWarmRestarts: warm restart로 주기적 리셋
warm_restart = CosineAnnealingWarmRestarts(
    optimizer,
    T_0=10,    # 첫 번째 리셋까지 에포크 수
    T_mult=2,  # 리셋마다 주기 T_mult배 증가
    eta_min=1e-6
)

# 학습 루프에서 스케줄러 사용
for epoch in range(100):
    train_loss = 0.5  # 실제 학습 루프 결과

    # 대부분의 스케줄러: epoch 단위로 step
    cosine_scheduler.step()

    # ReduceLROnPlateau: 검증 지표를 인자로 전달
    plateau_scheduler.step(train_loss)

    # OneCycleLR: 배치 단위로 step
    # for batch in loader:
    #     ...
    #     one_cycle.step()

    print(f"Epoch {epoch+1}: LR = {optimizer.param_groups[0]['lr']:.6f}")

13. 정규화 기법 — Dropout, BatchNorm, LayerNorm

과적합을 방지하고 학습을 안정화하는 정규화 기법들을 정리합니다.

import torch.nn as nn

# Dropout: 학습 시 무작위로 뉴런 비활성화
class DropoutDemo(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(100, 50)
        self.dropout = nn.Dropout(p=0.5)  # 50% 비활성화
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)  # 학습 시만 활성, 평가 시 자동 비활성
        return self.fc2(x)

# model.train() → Dropout 활성화
# model.eval() → Dropout 비활성화

# BatchNorm1d: 미니배치 정규화 (FC 레이어 후)
bn_model = nn.Sequential(
    nn.Linear(100, 64),
    nn.BatchNorm1d(64),  # 배치 차원으로 정규화
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.BatchNorm1d(32),
    nn.ReLU(),
    nn.Linear(32, 10)
)

# BatchNorm2d: 2D feature map (CNN 레이어 후)
cnn_with_bn = nn.Sequential(
    nn.Conv2d(3, 32, 3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.Conv2d(32, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU()
)

# LayerNorm: 특징 차원으로 정규화 (Transformer에 주로 사용)
# BatchNorm과 달리 배치 크기에 독립적
transformer_norm = nn.Sequential(
    nn.Linear(512, 512),
    nn.LayerNorm(512),  # 마지막 차원(512)으로 정규화
    nn.ReLU()
)

# GroupNorm: BatchNorm과 LayerNorm의 절충안 (소규모 배치에 유용)
group_norm = nn.GroupNorm(
    num_groups=8,    # 채널을 8그룹으로 분할
    num_channels=64  # 총 채널 수
)

# InstanceNorm: 스타일 전이 등에 활용
instance_norm = nn.InstanceNorm2d(64)

# 정규화 방법 비교 요약:
# BatchNorm  : 배치 × 공간 정규화 → CNN에 효과적, 배치 크기 의존
# LayerNorm  : 특징 차원 정규화 → Transformer, RNN에 효과적
# GroupNorm  : 소규모 배치에서 BatchNorm 대안
# InstanceNorm: 스타일 전이, 이미지 생성에 활용

14. 전이학습(Transfer Learning)

ImageNet으로 사전학습된 모델을 활용하여 적은 데이터로도 높은 성능을 냅니다.

import torchvision.models as models
import torch.nn as nn

# 사전학습 모델 로딩
# weights 인자로 명시적 지정 권장 (최신 API)
resnet50 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
vgg16 = models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1)
efficientnet = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.IMAGENET1K_V1)
vit = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)

print(resnet50)  # 구조 출력

# 방법 1: Feature Extractor (백본 동결)
# 백본 파라미터 고정
for param in resnet50.parameters():
    param.requires_grad = False

# 마지막 FC 레이어만 교체 (새 클래스 수에 맞게)
num_classes = 5
resnet50.fc = nn.Linear(resnet50.fc.in_features, num_classes)

# 마지막 레이어만 학습됨
trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"학습 가능 파라미터: {trainable:,}")  # 약 2,050개만

# 방법 2: Fine-tuning (전체 또는 일부 레이어 학습)
resnet_ft = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
resnet_ft.fc = nn.Linear(resnet_ft.fc.in_features, num_classes)

# 레이어별 다른 학습률 (낮은 레이어 = 낮은 LR)
optimizer = torch.optim.AdamW([
    {'params': resnet_ft.layer1.parameters(), 'lr': 1e-5},
    {'params': resnet_ft.layer2.parameters(), 'lr': 1e-5},
    {'params': resnet_ft.layer3.parameters(), 'lr': 1e-4},
    {'params': resnet_ft.layer4.parameters(), 'lr': 1e-4},
    {'params': resnet_ft.fc.parameters(),     'lr': 1e-3},
], lr=1e-4, weight_decay=0.01)

# 방법 3: torchvision transforms로 데이터 전처리
from torchvision import transforms

# 사전학습 모델의 입력 정규화 값 (ImageNet 기준)
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

15. 모델 저장과 로딩

학습한 모델을 저장하고 재사용하는 방법을 알아봅니다.

import torch
import torch.nn as nn

model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters())

# 방법 1: state_dict 저장 (권장)
# 모델 파라미터만 저장 (아키텍처 제외)
torch.save(model.state_dict(), 'model_weights.pt')

# 로딩
loaded_model = nn.Linear(10, 5)  # 동일한 아키텍처 필요
loaded_model.load_state_dict(torch.load('model_weights.pt',
                                         weights_only=True))
loaded_model.eval()

# 방법 2: 전체 모델 저장 (권장하지 않음 — 이식성 낮음)
torch.save(model, 'full_model.pt')
loaded_full = torch.load('full_model.pt', weights_only=False)

# 방법 3: 체크포인트 — 학습 재개를 위한 완전한 상태 저장
def save_checkpoint(model, optimizer, scheduler, epoch, loss, path):
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict() if scheduler else None,
        'loss': loss,
    }, path)
    print(f"체크포인트 저장: {path}")

def load_checkpoint(path, model, optimizer=None, scheduler=None):
    checkpoint = torch.load(path, map_location='cpu', weights_only=True)
    model.load_state_dict(checkpoint['model_state_dict'])

    if optimizer:
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    if scheduler and checkpoint['scheduler_state_dict']:
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])

    return checkpoint['epoch'], checkpoint['loss']

# 사용 예시
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
save_checkpoint(model, optimizer, scheduler, epoch=50, loss=0.25,
                path='checkpoint_ep50.pt')

start_epoch, prev_loss = load_checkpoint('checkpoint_ep50.pt',
                                          model, optimizer, scheduler)
print(f"재개: epoch={start_epoch}, loss={prev_loss:.4f}")

# GPU 모델을 CPU로 로딩
model_cpu = nn.Linear(10, 5)
model_cpu.load_state_dict(
    torch.load('model_weights.pt', map_location='cpu', weights_only=True)
)

16. TorchScript와 모델 배포

학습된 모델을 프로덕션 환경에 배포하는 방법을 다룹니다.

import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 5)

    def forward(self, x):
        return torch.relu(self.fc(x))

model = SimpleNet()
model.eval()

# 방법 1: torch.jit.script — 전체 모델 컴파일
scripted_model = torch.jit.script(model)

# 저장 및 로딩
scripted_model.save('model_scripted.pt')
loaded_scripted = torch.jit.load('model_scripted.pt')

x = torch.randn(4, 10)
with torch.no_grad():
    out = loaded_scripted(x)
print(f"TorchScript 출력: {out.shape}")

# 방법 2: torch.jit.trace — 예제 입력으로 추적
example_input = torch.randn(1, 10)
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')

# 방법 3: ONNX 내보내기 (다른 프레임워크 호환)
import torch.onnx

dummy_input = torch.randn(1, 10)
torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    export_params=True,
    opset_version=17,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)
print("ONNX 내보내기 완료")

# torch.compile (PyTorch 2.0+): 최신 컴파일 방식
# 기존 코드 변경 없이 적용 가능
compiled_model = torch.compile(model)
out = compiled_model(x)
print(f"torch.compile 출력: {out.shape}")

17. 분산 학습(DDP) — DistributedDataParallel

여러 GPU를 활용해 학습 속도를 크게 높이는 방법입니다.

공식 튜토리얼: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

# train_ddp.py — 단독 실행 스크립트로 작성
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import datasets, transforms

def setup(rank, world_size):
    """프로세스 그룹 초기화"""
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # nccl: GPU 통신 백엔드 (권장)
    # gloo: CPU 또는 디버깅용
    dist.init_process_group(
        backend='nccl',
        rank=rank,
        world_size=world_size
    )

def cleanup():
    """프로세스 그룹 정리"""
    dist.destroy_process_group()

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.net(x.view(x.size(0), -1))

def train(rank, world_size, num_epochs=5):
    print(f"프로세스 {rank}/{world_size} 시작")
    setup(rank, world_size)

    # 각 프로세스에 GPU 할당
    torch.cuda.set_device(rank)
    device = torch.device(f'cuda:{rank}')

    # 모델 생성 및 DDP 래핑
    model = SimpleModel().to(device)
    ddp_model = DDP(model, device_ids=[rank])

    # 데이터셋과 분산 샘플러
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('./data', train=True, download=True,
                              transform=transform)

    # DistributedSampler: 각 프로세스에 다른 데이터 할당
    sampler = DistributedSampler(
        dataset,
        num_replicas=world_size,
        rank=rank,
        shuffle=True
    )

    loader = DataLoader(
        dataset,
        batch_size=128,
        sampler=sampler,
        num_workers=4,
        pin_memory=True
    )

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.001)

    for epoch in range(num_epochs):
        # 매 에포크마다 샘플러 시드 업데이트 (데이터 섞기)
        sampler.set_epoch(epoch)

        ddp_model.train()
        total_loss = 0.0

        for batch_idx, (data, target) in enumerate(loader):
            data, target = data.to(device), target.to(device)

            optimizer.zero_grad()
            output = ddp_model(data)
            loss = criterion(output, target)
            loss.backward()  # 자동으로 그래디언트 동기화
            optimizer.step()

            total_loss += loss.item()

        # rank 0에서만 로그 출력
        if rank == 0:
            avg_loss = total_loss / len(loader)
            print(f"Epoch {epoch+1}: Average Loss = {avg_loss:.4f}")

    cleanup()

# 실행: torchrun --nproc_per_node=4 train_ddp.py
if __name__ == '__main__':
    import torch.multiprocessing as mp
    world_size = torch.cuda.device_count()
    mp.spawn(
        train,
        args=(world_size, 5),
        nprocs=world_size,
        join=True
    )

torchrun으로 실행

# 단일 노드 4 GPU 학습
torchrun --nproc_per_node=4 train_ddp.py

# 다중 노드 학습 (노드 0)
torchrun --nnodes=2 --nproc_per_node=4 \
         --node_rank=0 \
         --master_addr="192.168.1.100" \
         --master_port=12355 \
         train_ddp.py

DataParallel vs DistributedDataParallel

# DataParallel (DP): 단순하지만 비효율적
# - 모든 그래디언트가 GPU 0으로 집결 → 병목
# - 멀티 프로세스가 아닌 멀티 스레드 방식
model_dp = nn.DataParallel(model, device_ids=[0, 1, 2, 3])

# DistributedDataParallel (DDP): 권장 방식
# - 각 GPU가 독립적으로 그래디언트 계산
# - All-Reduce로 효율적 동기화
# - 단일 GPU에서도 DDP가 빠름 (Python GIL 회피)
model_ddp = DDP(model, device_ids=[rank])

18. 고급 기법 모음

혼합 정밀도 학습 (Mixed Precision)

from torch.cuda.amp import autocast, GradScaler

model = SimpleModel().to('cuda')
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()  # FP16 손실 스케일링

for data, target in train_loader:
    data, target = data.to('cuda'), target.to('cuda')
    optimizer.zero_grad()

    # FP16으로 순전파
    with autocast():
        output = model(data)
        loss = criterion(output, target)

    # 스케일된 역전파
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

그래디언트 클리핑

# 폭발하는 그래디언트 방지
max_grad_norm = 1.0
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()

재현성(Reproducibility) 설정

import random
import numpy as np

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # 완전 재현성을 위해 (성능 저하 있음)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)

마치며

이 가이드에서는 PyTorch의 핵심 개념부터 실용적인 분산 학습까지 체계적으로 살펴봤습니다. 다음은 학습 로드맵입니다.

기초 단계: 텐서 조작, Autograd, 단순 모델 구현
중급 단계: CNN, RNN, Transfer Learning, DataLoader 최적화
고급 단계: Transformer, 분산 학습(DDP), Mixed Precision
배포 단계: TorchScript, ONNX, torch.compile

PyTorch 생태계는 지속적으로 발전하고 있습니다. 최신 기능과 업데이트는 공식 문서와 PyTorch 블로그를 참조하세요.

참고 자료

PyTorch Complete Guide: Zero to Hero — From Tensors to Distributed Training

Introduction
1. Environment Setup
- Installing PyTorch
- Verifying GPU Availability
2. Tensor Basics
- Creating Tensors
- Tensor Attributes and Type Conversion
- Reshaping Tensors
- Tensor Operations
- Broadcasting
- Indexing and Slicing
3. Automatic Differentiation (Autograd)
- requires_grad and Computational Graph
- Gradients for Multi-dimensional Tensors
- Gradient Control
- Higher-order Gradients
4. nn.Module — The Foundation of Neural Networks
- Sequential, ModuleList, ModuleDict
5. Linear Regression from Scratch
6. Multi-Layer Perceptron (MLP) — MNIST Classification
7. Convolutional Neural Network (CNN) — CIFAR-10 Classification
8. Recurrent Neural Networks (RNN / LSTM) — Sequential Data
9. Transformer — Multi-head Attention from Scratch
10. Data Loading — Dataset and DataLoader
11. Optimizers — SGD, Adam, AdamW
12. Learning Rate Schedulers
13. Regularization — Dropout, BatchNorm, LayerNorm
14. Transfer Learning
15. Saving and Loading Models
16. TorchScript and Model Deployment
17. Distributed Training (DDP) — DistributedDataParallel
- Launching with torchrun
- DataParallel vs DistributedDataParallel
18. Advanced Techniques
- Mixed Precision Training
- Gradient Clipping
- Reproducibility
Conclusion
- References

Introduction

Of the two dominant deep learning frameworks — TensorFlow and PyTorch — PyTorch has become the preferred choice for researchers and engineers alike. Released by Facebook AI Research (now Meta AI) in 2016, PyTorch quickly became the standard for implementing academic papers and now surpasses TensorFlow in industrial adoption as well.

This guide targets readers with basic Python knowledge and walks through everything from first contact with PyTorch all the way to distributed training. Each section includes runnable code examples and links to the official documentation so you can read and practice simultaneously.

Official docs: https://pytorch.org/docs/stable/index.html Official tutorials: https://pytorch.org/tutorials/

1. Environment Setup

Installing PyTorch

PyTorch can be installed via pip or conda. To use a GPU, select the package matching your CUDA version.

pip install (CUDA 12.1):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

conda install:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

CPU-only install:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Verifying GPU Availability

import torch

# Check PyTorch version
print(f"PyTorch version: {torch.__version__}")

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")

# GPU count and info
if torch.cuda.is_available():
    print(f"GPU count: {torch.cuda.device_count()}")
    print(f"Current GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Apple Silicon (M1/M2/M3) MPS check
print(f"MPS available: {torch.backends.mps.is_available()}")

# Auto-select device
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print(f"Using device: {device}")

2. Tensor Basics

Tensors are the foundational data structure of PyTorch. They are similar to NumPy's ndarray but support GPU computation and automatic differentiation.

Official docs: https://pytorch.org/docs/stable/tensors.html

Creating Tensors

import torch
import numpy as np

# From data directly
t1 = torch.tensor([1, 2, 3, 4, 5])
print(f"1D tensor: {t1}, shape: {t1.shape}, dtype: {t1.dtype}")

# 2D tensor (matrix)
t2 = torch.tensor([[1.0, 2.0, 3.0],
                   [4.0, 5.0, 6.0]])
print(f"2D tensor:\n{t2}, shape: {t2.shape}")

# Special tensor creation
zeros = torch.zeros(3, 4)           # all zeros
ones = torch.ones(2, 3)             # all ones
rand = torch.rand(3, 3)             # uniform [0, 1)
randn = torch.randn(3, 3)           # standard normal
eye = torch.eye(4)                   # identity matrix
arange = torch.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5)  # 5 evenly spaced values

# Create with same shape as existing tensor
t3 = torch.zeros_like(t2)
t4 = torch.ones_like(t2)
t5 = torch.rand_like(t2)

# From NumPy array (shared memory)
np_arr = np.array([1.0, 2.0, 3.0])
t_from_np = torch.from_numpy(np_arr)

# Tensor to NumPy (CPU only)
np_from_t = t1.numpy()

Tensor Attributes and Type Conversion

t = torch.rand(3, 4, 5)

print(f"shape: {t.shape}")      # torch.Size([3, 4, 5])
print(f"ndim: {t.ndim}")        # 3
print(f"dtype: {t.dtype}")      # torch.float32
print(f"device: {t.device}")    # cpu
print(f"numel: {t.numel()}")    # 60 (total elements)

# Type conversion
t_int    = t.to(torch.int32)
t_long   = t.long()     # torch.int64
t_float  = t.float()    # torch.float32
t_double = t.double()   # torch.float64
t_half   = t.half()     # torch.float16

# Move to GPU
if torch.cuda.is_available():
    t_gpu  = t.to("cuda")
    t_back = t_gpu.cpu()  # back to CPU

Reshaping Tensors

t = torch.arange(24)  # 1D tensor 0..23

t_2d   = t.reshape(4, 6)
t_3d   = t.reshape(2, 3, 4)
t_auto = t.reshape(6, -1)   # -1 infers the size (6x4)

# view: like reshape but requires contiguous memory
t_view = t.view(3, 8)

# squeeze / unsqueeze
t = torch.zeros(1, 3, 1, 4)
t_sq    = t.squeeze()         # remove size-1 dims → [3, 4]
t_sq1   = t.squeeze(0)        # remove dim 0 only → [3, 1, 4]
t_unsq  = t_sq.unsqueeze(0)   # add dim at 0 → [1, 3, 4]

# transpose / permute
t = torch.rand(2, 3, 4)
t_T    = t.transpose(0, 1)      # [3, 2, 4]
t_perm = t.permute(2, 0, 1)     # [4, 2, 3]
t_cont = t_perm.contiguous()    # ensure contiguous memory

Tensor Operations

a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

# Element-wise arithmetic
print(a + b)   # torch.add(a, b)
print(a - b)   # torch.sub(a, b)
print(a * b)   # Hadamard product
print(a / b)
print(a ** 2)

# Matrix multiplication
matmul = a @ b          # or torch.matmul(a, b)
mm     = torch.mm(a, b) # 2D only

# Reduction
t = torch.rand(3, 4)
print(t.sum())
print(t.mean())
print(t.max())
print(t.std())
print(t.sum(dim=0))              # sum along rows
print(t.sum(dim=1, keepdim=True))

# argmax / argmin
print(t.argmax())
print(t.argmax(dim=1))

Broadcasting

a = torch.tensor([[1, 2, 3],
                  [4, 5, 6]])   # shape: [2, 3]
b = torch.tensor([10, 20, 30])  # shape: [3]

# b is broadcast to [2, 3]
print(a + b)
# tensor([[11, 22, 33],
#         [14, 25, 36]])

# Column vector + row vector
col = torch.tensor([[1], [2], [3]])  # [3, 1]
row = torch.tensor([10, 20, 30])      # [3]
print(col + row)  # [3, 3] outer-sum

Indexing and Slicing

t = torch.arange(24).reshape(2, 3, 4).float()

print(t[0])          # first matrix [3, 4]
print(t[0, 1])       # second row [4]
print(t[0, 1, 2])    # scalar

print(t[:, 1:, :2])  # slicing

# Fancy indexing
indices = torch.tensor([0, 2])
print(t[:, indices, :])

# Boolean masking
mask   = t > 10
print(t[mask])       # 1D tensor of elements > 10

# torch.where
a = torch.tensor([1.0, 2.0, 3.0, 4.0])
b = torch.tensor([10.0, 20.0, 30.0, 40.0])
print(torch.where(a > 2, b, a))  # tensor([ 1.,  2., 30., 40.])

3. Automatic Differentiation (Autograd)

Autograd automatically builds a computational graph and computes gradients via backpropagation — the engine behind all neural network training.

Official docs: https://pytorch.org/docs/stable/autograd.html

requires_grad and Computational Graph

import torch

x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

z = x ** 2 + 2 * x * y + y ** 2  # (x + y)^2 = 49

z.backward()

# dz/dx = 2x + 2y = 2*3 + 2*4 = 14
print(f"dz/dx = {x.grad}")  # 14.0
print(f"dz/dy = {y.grad}")  # 14.0

Gradients for Multi-dimensional Tensors

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
z = y.sum()

z.backward()
print(f"x.grad: {x.grad}")  # [2, 4, 6]  (dz/dx = 2x)

# Non-scalar backward with gradient argument
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
grad_output = torch.ones(3)
y.backward(gradient=grad_output)
print(f"x.grad: {x.grad}")  # [2, 4, 6]

Gradient Control

x = torch.tensor(2.0, requires_grad=True)

for i in range(3):
    y = x ** 2
    y.backward()
    print(f"iteration {i}: x.grad = {x.grad}")
    x.grad.zero_()  # IMPORTANT: reset gradient every step

# no_grad: disable gradient tracking for inference
with torch.no_grad():
    y = x ** 2
    print(f"y.requires_grad: {y.requires_grad}")  # False

# detach: separate tensor from the graph
x = torch.tensor([1.0, 2.0], requires_grad=True)
z = (x * 2).detach()
print(f"z.requires_grad: {z.requires_grad}")  # False

Higher-order Gradients

x = torch.tensor(3.0, requires_grad=True)
y = x ** 4

# First derivative: dy/dx = 4x^3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"First derivative: {dy_dx}")   # 108

# Second derivative: d2y/dx2 = 12x^2
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"Second derivative: {d2y_dx2}")  # 108

4. nn.Module — The Foundation of Neural Networks

torch.nn.Module is the base class for all PyTorch models. Every layer, activation function, and complete model inherits from it.

Official docs: https://pytorch.org/docs/stable/nn.html

import torch
import torch.nn as nn

class SimpleModel(nn.Module):
    def __init__(self, in_features, hidden_size, out_features):
        super().__init__()
        # Layers are automatically registered as parameters
        self.fc1     = nn.Linear(in_features, hidden_size)
        self.relu    = nn.ReLU()
        self.fc2     = nn.Linear(hidden_size, out_features)
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

model = SimpleModel(784, 256, 10)
print(model)

# Count parameters
total     = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total params: {total:,}")
print(f"Trainable params: {trainable:,}")

# Iterate named parameters
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

# Forward pass
x      = torch.randn(32, 784)
output = model(x)
print(f"Output shape: {output.shape}")  # [32, 10]

Sequential, ModuleList, ModuleDict

# Sequential: stack layers in order
seq_model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# ModuleList: manage layers as a list
class ResidualNet(nn.Module):
    def __init__(self, num_blocks, hidden_size):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(hidden_size, hidden_size)
            for _ in range(num_blocks)
        ])
        self.relu = nn.ReLU()

    def forward(self, x):
        for layer in self.layers:
            x = self.relu(layer(x)) + x  # residual connection
        return x

# ModuleDict: manage layers as a dictionary
class MultiTaskModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.Linear(784, 256)
        self.heads    = nn.ModuleDict({
            'classification': nn.Linear(256, 10),
            'regression':     nn.Linear(256, 1)
        })

    def forward(self, x, task='classification'):
        features = torch.relu(self.backbone(x))
        return self.heads[task](features)

5. Linear Regression from Scratch

Linear regression is the simplest deep learning model. Building it from scratch solidifies understanding of the training loop.

import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(42)
n_samples = 200

# Generate synthetic data: y = 3x + 2 + noise
X      = torch.linspace(-5, 5, n_samples).unsqueeze(1)
y_true = 3 * X + 2
y      = y_true + torch.randn_like(y_true) * 0.5

class LinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

model     = LinearRegression()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

n_epochs = 1000
losses   = []

for epoch in range(n_epochs):
    # 1. Forward pass
    y_pred = model(X)

    # 2. Compute loss
    loss = criterion(y_pred, y)
    losses.append(loss.item())

    # 3. Zero gradients (critical!)
    optimizer.zero_grad()

    # 4. Backward pass
    loss.backward()

    # 5. Update parameters
    optimizer.step()

    if (epoch + 1) % 200 == 0:
        w = model.linear.weight.item()
        b = model.linear.bias.item()
        print(f"Epoch {epoch+1}: Loss={loss.item():.4f}, w={w:.4f}, b={b:.4f}")

print(f"\nLearned weight: {model.linear.weight.item():.4f} (true: 3.0)")
print(f"Learned bias:   {model.linear.bias.item():.4f}   (true: 2.0)")

6. Multi-Layer Perceptron (MLP) — MNIST Classification

Building a complete classification model on the MNIST handwritten digit dataset.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

BATCH_SIZE    = 64
LEARNING_RATE = 0.001
N_EPOCHS      = 10
DEVICE        = torch.device("cuda" if torch.cuda.is_available() else "cpu")

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST('./data', train=True,  download=True, transform=transform)
test_dataset  = datasets.MNIST('./data', train=False,                transform=transform)

train_loader  = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True,  num_workers=2)
test_loader   = DataLoader(test_dataset,  batch_size=BATCH_SIZE, shuffle=False, num_workers=2)

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Flatten(),
            nn.Linear(784, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.network(x)

model     = MLP().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct    = 0
    total      = 0

    for data, target in loader:
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss   = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        pred        = output.argmax(dim=1)
        correct    += pred.eq(target).sum().item()
        total      += target.size(0)

    return total_loss / len(loader), 100.0 * correct / total

def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss = 0
    correct    = 0
    total      = 0

    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.to(device)
            output      = model(data)
            total_loss += criterion(output, target).item()
            pred        = output.argmax(dim=1)
            correct    += pred.eq(target).sum().item()
            total      += target.size(0)

    return total_loss / len(loader), 100.0 * correct / total

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, DEVICE)
    test_loss,  test_acc  = evaluate(model, test_loader, criterion, DEVICE)
    print(f"Epoch {epoch+1}/{N_EPOCHS} | "
          f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | "
          f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")

7. Convolutional Neural Network (CNN) — CIFAR-10 Classification

Implementing a VGG-style CNN for image classification on CIFAR-10.

import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010))
])

train_data = datasets.CIFAR10('./data', train=True,  download=True, transform=transform_train)
test_data  = datasets.CIFAR10('./data', train=False, transform=transform_test)

train_loader = DataLoader(train_data, batch_size=128, shuffle=True,  num_workers=4)
test_loader  = DataLoader(test_data,  batch_size=128, shuffle=False, num_workers=4)

CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck']

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 3 → 64, 32x32 → 16x16
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.1),

            # Block 2: 64 → 128, 16x16 → 8x8
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.2),

            # Block 3: 128 → 256, 8x8 → 4x4
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(256 * 4 * 4, 1024),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(1024, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

model = CNN().to(DEVICE)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

8. Recurrent Neural Networks (RNN / LSTM) — Sequential Data

RNNs and LSTMs excel at time-series data and natural language processing tasks.

import torch
import torch.nn as nn
import numpy as np

class LSTMPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2,
                 output_size=1, dropout=0.2):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers  = num_layers

        self.lstm = nn.LSTM(
            input_size  = input_size,
            hidden_size = hidden_size,
            num_layers  = num_layers,
            batch_first = True,   # input: [batch, seq_len, features]
            dropout     = dropout if num_layers > 1 else 0,
            bidirectional = False
        )

        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 32),
            nn.ReLU(),
            nn.Linear(32, output_size)
        )

    def forward(self, x):
        batch_size = x.size(0)

        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)

        # out: [batch_size, seq_len, hidden_size]
        out, (hn, cn) = self.lstm(x, (h0, c0))

        # Use only the last time step
        out = self.fc(out[:, -1, :])
        return out

# Generate sine wave dataset
t    = np.linspace(0, 100, 1000)
data = np.sin(0.5 * t) + 0.1 * np.random.randn(1000)
data = torch.FloatTensor(data).unsqueeze(1)

def create_sequences(data, seq_len=50):
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i:i+seq_len])
        y.append(data[i+seq_len])
    return torch.stack(X), torch.stack(y)

X, y = create_sequences(data, seq_len=50)
print(f"X shape: {X.shape}")  # [950, 50, 1]
print(f"y shape: {y.shape}")  # [950, 1]

# GRU — fewer parameters than LSTM
class GRUPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2):
        super().__init__()
        self.gru = nn.GRU(input_size, hidden_size, num_layers,
                          batch_first=True, dropout=0.2)
        self.fc  = nn.Linear(hidden_size, 1)

    def forward(self, x):
        out, _ = self.gru(x)
        return self.fc(out[:, -1, :])

9. Transformer — Multi-head Attention from Scratch

Implementing the key components of the "Attention Is All You Need" paper.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model   = d_model
        self.num_heads = num_heads
        self.d_k       = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)
        self.scale   = math.sqrt(self.d_k)

    def split_heads(self, x):
        # [batch, seq, d_model] → [batch, num_heads, seq, d_k]
        batch, seq, _ = x.shape
        x = x.view(batch, seq, self.num_heads, self.d_k)
        return x.transpose(1, 2)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        Q = self.split_heads(self.W_q(query))
        K = self.split_heads(self.W_k(key))
        V = self.split_heads(self.W_v(value))

        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context = torch.matmul(attn_weights, V)
        context = context.transpose(1, 2).contiguous()
        context = context.view(batch_size, -1, self.d_model)

        output = self.W_o(context)
        return output, attn_weights

class FeedForward(nn.Module):
    def __init__(self, d_model=512, d_ff=2048, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )

    def forward(self, x):
        return self.net(x)

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.ff        = FeedForward(d_model, d_ff, dropout)
        self.norm1     = nn.LayerNorm(d_model)
        self.norm2     = nn.LayerNorm(d_model)
        self.dropout   = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        attn_out, _ = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_out))
        ff_out = self.ff(x)
        x = self.norm2(x + self.dropout(ff_out))
        return x

class PositionalEncoding(nn.Module):
    def __init__(self, d_model=512, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        pe       = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

# Example usage
d_model      = 512
encoder_layer = TransformerEncoderLayer(d_model=d_model, num_heads=8)
pos_enc       = PositionalEncoding(d_model=d_model)

x      = torch.randn(2, 10, d_model)
x      = pos_enc(x)
output = encoder_layer(x)
print(f"Transformer Encoder output: {output.shape}")  # [2, 10, 512]

10. Data Loading — Dataset and DataLoader

An efficient data pipeline is directly tied to training speed and flexibility.

Official tutorial: https://pytorch.org/tutorials/beginner/basics/intro.html

import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from PIL import Image
import os

# Custom image Dataset
class CustomImageDataset(Dataset):
    def __init__(self, csv_file, img_dir, transform=None):
        self.annotations = pd.read_csv(csv_file)
        self.img_dir     = img_dir
        self.transform   = transform

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.annotations.iloc[idx, 0])
        image    = Image.open(img_path).convert('RGB')
        label    = int(self.annotations.iloc[idx, 1])
        if self.transform:
            image = self.transform(image)
        return image, label

# Tabular Dataset
class TabularDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.FloatTensor(X)
        self.y = torch.LongTensor(y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

dataset = TabularDataset(
    X=np.random.randn(1000, 20),
    y=np.random.randint(0, 5, 1000)
)

# Advanced DataLoader settings
advanced_loader = DataLoader(
    dataset,
    batch_size       = 64,
    shuffle          = True,
    num_workers      = 4,       # parallel data loading
    pin_memory       = True,    # faster GPU transfer
    drop_last        = True,    # drop incomplete last batch
    prefetch_factor  = 2,
    persistent_workers = True
)

for batch_X, batch_y in advanced_loader:
    print(f"batch X: {batch_X.shape}")  # [64, 20]
    print(f"batch y: {batch_y.shape}")  # [64]
    break

# WeightedRandomSampler: handle class imbalance
from torch.utils.data import WeightedRandomSampler

class_counts   = [800, 150, 50]
weights        = 1.0 / torch.tensor(class_counts, dtype=torch.float)
sample_weights = weights[dataset.y]

sampler = WeightedRandomSampler(
    weights     = sample_weights,
    num_samples = len(dataset),
    replacement = True
)

balanced_loader = DataLoader(dataset, batch_size=32, sampler=sampler)

11. Optimizers — SGD, Adam, AdamW

Official docs: https://pytorch.org/docs/stable/optim.html

import torch.optim as optim

model = nn.Linear(100, 10)

# SGD with momentum and weight decay
sgd = optim.SGD(
    model.parameters(),
    lr           = 0.01,
    momentum     = 0.9,
    weight_decay = 1e-4,
    nesterov     = True
)

# Adam: adaptive learning rates
adam = optim.Adam(
    model.parameters(),
    lr           = 0.001,
    betas        = (0.9, 0.999),
    eps          = 1e-8,
    weight_decay = 0
)

# AdamW: correct decoupled weight decay (recommended for Transformers)
adamw = optim.AdamW(
    model.parameters(),
    lr           = 1e-3,
    betas        = (0.9, 0.999),
    weight_decay = 0.01
)

# Per-parameter learning rates (useful for Transfer Learning)
optimizer = optim.Adam([
    {'params': model.features.parameters(),   'lr': 1e-4},
    {'params': model.classifier.parameters(), 'lr': 1e-3},
], lr=1e-3)

# Save and restore optimizer state
checkpoint = {
    'model':     model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch':     10
}
torch.save(checkpoint, 'checkpoint.pt')

ckpt = torch.load('checkpoint.pt')
model.load_state_dict(ckpt['model'])
optimizer.load_state_dict(ckpt['optimizer'])

12. Learning Rate Schedulers

Using a scheduler almost always improves final performance compared to a fixed learning rate.

from torch.optim.lr_scheduler import (
    StepLR, CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau,
    CosineAnnealingWarmRestarts
)

model     = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)

# StepLR: multiply LR by gamma every step_size epochs
step_scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# CosineAnnealingLR: cosine decay
cosine_scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# ReduceLROnPlateau: reduce when metric stops improving
plateau_scheduler = ReduceLROnPlateau(
    optimizer,
    mode      = 'min',
    factor    = 0.5,
    patience  = 10,
    min_lr    = 1e-7,
    verbose   = True
)

# OneCycleLR: super-convergence
one_cycle = OneCycleLR(
    optimizer,
    max_lr           = 0.01,
    steps_per_epoch  = 100,
    epochs           = 30,
    pct_start        = 0.3,
    anneal_strategy  = 'cos'
)

# CosineAnnealingWarmRestarts: periodic restarts
warm_restart = CosineAnnealingWarmRestarts(
    optimizer,
    T_0     = 10,
    T_mult  = 2,
    eta_min = 1e-6
)

# Usage in training loop
for epoch in range(100):
    train_loss = 0.5  # from actual training

    cosine_scheduler.step()
    plateau_scheduler.step(train_loss)  # pass metric

    print(f"Epoch {epoch+1}: LR = {optimizer.param_groups[0]['lr']:.6f}")

13. Regularization — Dropout, BatchNorm, LayerNorm

Regularization prevents overfitting and stabilizes training.

import torch.nn as nn

# Dropout: randomly zero out neurons during training
class DropoutDemo(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1     = nn.Linear(100, 50)
        self.dropout = nn.Dropout(p=0.5)
        self.fc2     = nn.Linear(50, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)   # active during train(), inactive during eval()
        return self.fc2(x)

# BatchNorm1d: normalize over the batch dimension (for FC layers)
bn_model = nn.Sequential(
    nn.Linear(100, 64),
    nn.BatchNorm1d(64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.BatchNorm1d(32),
    nn.ReLU(),
    nn.Linear(32, 10)
)

# BatchNorm2d: for 2D feature maps (after Conv layers)
cnn_bn = nn.Sequential(
    nn.Conv2d(3, 32, 3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
)

# LayerNorm: normalize over feature dimension (preferred for Transformers)
transformer_norm = nn.Sequential(
    nn.Linear(512, 512),
    nn.LayerNorm(512),
    nn.ReLU()
)

# GroupNorm: a middle ground between BatchNorm and LayerNorm
group_norm = nn.GroupNorm(num_groups=8, num_channels=64)

# InstanceNorm: used in style transfer
instance_norm = nn.InstanceNorm2d(64)

# Summary:
# BatchNorm   → CNN, batch-level statistics, depends on batch size
# LayerNorm   → Transformers / RNNs, feature-level statistics
# GroupNorm   → small batches where BatchNorm is unstable
# InstanceNorm → style transfer, image generation

14. Transfer Learning

Leveraging ImageNet-pretrained models to achieve high performance with limited data.

import torchvision.models as models
import torch.nn as nn

# Load pretrained models
resnet50    = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
efficientnet = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.IMAGENET1K_V1)
vit         = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)

# Strategy 1: Feature Extractor (freeze backbone)
for param in resnet50.parameters():
    param.requires_grad = False

num_classes  = 5
resnet50.fc  = nn.Linear(resnet50.fc.in_features, num_classes)

trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"Trainable params: {trainable:,}")  # ~2,050

# Strategy 2: Fine-tuning with layer-wise learning rates
resnet_ft    = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
resnet_ft.fc = nn.Linear(resnet_ft.fc.in_features, num_classes)

optimizer = torch.optim.AdamW([
    {'params': resnet_ft.layer1.parameters(), 'lr': 1e-5},
    {'params': resnet_ft.layer2.parameters(), 'lr': 1e-5},
    {'params': resnet_ft.layer3.parameters(), 'lr': 1e-4},
    {'params': resnet_ft.layer4.parameters(), 'lr': 1e-4},
    {'params': resnet_ft.fc.parameters(),     'lr': 1e-3},
], lr=1e-4, weight_decay=0.01)

# ImageNet normalization for preprocessing
from torchvision import transforms

preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std =[0.229, 0.224, 0.225])
])

15. Saving and Loading Models

import torch
import torch.nn as nn

model     = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters())

# Option 1: state_dict (recommended)
torch.save(model.state_dict(), 'model_weights.pt')

loaded_model = nn.Linear(10, 5)
loaded_model.load_state_dict(torch.load('model_weights.pt', weights_only=True))
loaded_model.eval()

# Option 2: full model (not recommended — low portability)
torch.save(model, 'full_model.pt')

# Option 3: checkpoint — save full training state
def save_checkpoint(model, optimizer, scheduler, epoch, loss, path):
    torch.save({
        'epoch':                epoch,
        'model_state_dict':     model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict() if scheduler else None,
        'loss':                 loss,
    }, path)

def load_checkpoint(path, model, optimizer=None, scheduler=None):
    ckpt = torch.load(path, map_location='cpu', weights_only=True)
    model.load_state_dict(ckpt['model_state_dict'])
    if optimizer:
        optimizer.load_state_dict(ckpt['optimizer_state_dict'])
    if scheduler and ckpt['scheduler_state_dict']:
        scheduler.load_state_dict(ckpt['scheduler_state_dict'])
    return ckpt['epoch'], ckpt['loss']

# Load GPU model onto CPU
model_cpu = nn.Linear(10, 5)
model_cpu.load_state_dict(
    torch.load('model_weights.pt', map_location='cpu', weights_only=True)
)

16. TorchScript and Model Deployment

Deploying trained models to production environments.

import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 5)

    def forward(self, x):
        return torch.relu(self.fc(x))

model = SimpleNet()
model.eval()

# Option 1: torch.jit.script — compile entire model
scripted_model = torch.jit.script(model)
scripted_model.save('model_scripted.pt')
loaded_scripted = torch.jit.load('model_scripted.pt')

x = torch.randn(4, 10)
with torch.no_grad():
    out = loaded_scripted(x)
print(f"TorchScript output: {out.shape}")

# Option 2: torch.jit.trace — trace with example input
example_input = torch.randn(1, 10)
traced_model  = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')

# Option 3: ONNX export (cross-framework compatibility)
dummy_input = torch.randn(1, 10)
torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    export_params  = True,
    opset_version  = 17,
    input_names    = ['input'],
    output_names   = ['output'],
    dynamic_axes   = {
        'input':  {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)
print("ONNX export complete")

# Option 4: torch.compile (PyTorch 2.0+)
compiled_model = torch.compile(model)
out = compiled_model(x)
print(f"torch.compile output: {out.shape}")

17. Distributed Training (DDP) — DistributedDataParallel

Using multiple GPUs to dramatically accelerate training.

Official tutorial: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

# train_ddp.py — run as a standalone script
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import datasets, transforms

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group(
        backend    = 'nccl',
        rank       = rank,
        world_size = world_size
    )

def cleanup():
    dist.destroy_process_group()

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.net(x.view(x.size(0), -1))

def train(rank, world_size, num_epochs=5):
    print(f"Process {rank}/{world_size} starting")
    setup(rank, world_size)

    torch.cuda.set_device(rank)
    device = torch.device(f'cuda:{rank}')

    # Wrap model with DDP
    model     = SimpleModel().to(device)
    ddp_model = DDP(model, device_ids=[rank])

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)

    # DistributedSampler ensures each process sees a unique data shard
    sampler = DistributedSampler(
        dataset,
        num_replicas = world_size,
        rank         = rank,
        shuffle      = True
    )

    loader = DataLoader(
        dataset,
        batch_size   = 128,
        sampler      = sampler,
        num_workers  = 4,
        pin_memory   = True
    )

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.001)

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # shuffle differently each epoch
        ddp_model.train()
        total_loss = 0.0

        for data, target in loader:
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = ddp_model(data)
            loss   = criterion(output, target)
            loss.backward()   # gradients are automatically all-reduced
            optimizer.step()
            total_loss += loss.item()

        if rank == 0:
            avg_loss = total_loss / len(loader)
            print(f"Epoch {epoch+1}: avg loss = {avg_loss:.4f}")

    cleanup()

if __name__ == '__main__':
    import torch.multiprocessing as mp
    world_size = torch.cuda.device_count()
    mp.spawn(train, args=(world_size, 5), nprocs=world_size, join=True)

Launching with torchrun

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train_ddp.py

# Multi-node (node 0 of 2)
torchrun --nnodes=2 --nproc_per_node=4 \
         --node_rank=0 \
         --master_addr="192.168.1.100" \
         --master_port=12355 \
         train_ddp.py

DataParallel vs DistributedDataParallel

# DataParallel (DP): simple but inefficient
# - all gradients funnel through GPU 0 → bottleneck
# - multi-thread, not multi-process
model_dp = nn.DataParallel(model, device_ids=[0, 1, 2, 3])

# DistributedDataParallel (DDP): recommended
# - each GPU computes gradients independently
# - efficient all-reduce synchronization
# - faster than DP even on a single GPU (avoids Python GIL)
model_ddp = DDP(model, device_ids=[rank])

18. Advanced Techniques

Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler

model     = SimpleModel().to('cuda')
optimizer = torch.optim.Adam(model.parameters())
scaler    = GradScaler()

for data, target in train_loader:
    data, target = data.to('cuda'), target.to('cuda')
    optimizer.zero_grad()

    # Forward pass in FP16
    with autocast():
        output = model(data)
        loss   = criterion(output, target)

    # Scaled backward pass
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Gradient Clipping

# Prevent exploding gradients
max_grad_norm = 1.0
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()

Reproducibility

import random
import numpy as np

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark     = False

set_seed(42)

Conclusion

This guide covered the core concepts of PyTorch from the ground up, all the way to distributed training in production. Here is a recommended learning roadmap:

Foundations: tensor operations, autograd, simple model implementations
Intermediate: CNN, RNN, Transfer Learning, DataLoader optimization
Advanced: Transformer, DDP, Mixed Precision Training
Deployment: TorchScript, ONNX, torch.compile

The PyTorch ecosystem is continuously evolving. Check the official documentation and PyTorch blog for the latest features and updates.

PyTorch 완전 정복 가이드: Zero to Hero — 텐서부터 분산 학습까지

들어가며

1. 환경 설정

PyTorch 설치

GPU 사용 가능 여부 확인

2. 텐서(Tensor) 기초

텐서 생성

텐서 속성 및 타입 변환

텐서 형태 변환

텐서 연산

브로드캐스팅

인덱싱과 슬라이싱

3. 자동 미분(Autograd)

requires_grad와 연산 그래프

다차원 텐서에서의 역전파

그래디언트 제어

고계 미분

4. nn.Module — 신경망 구축의 기반

Sequential, ModuleList, ModuleDict

5. 선형 회귀 구현

6. 다층 퍼셉트론(MLP) — MNIST 분류

7. 합성곱 신경망(CNN) — CIFAR-10 분류

8. 순환 신경망(RNN/LSTM) — 시계열 처리

9. Transformer 구현 — Multi-head Attention from Scratch

10. 데이터 로딩 — Dataset, DataLoader

11. 옵티마이저 — SGD, Adam, AdamW 비교

12. 학습률 스케줄러

13. 정규화 기법 — Dropout, BatchNorm, LayerNorm

14. 전이학습(Transfer Learning)

15. 모델 저장과 로딩

16. TorchScript와 모델 배포

17. 분산 학습(DDP) — DistributedDataParallel

torchrun으로 실행

DataParallel vs DistributedDataParallel

18. 고급 기법 모음

혼합 정밀도 학습 (Mixed Precision)

그래디언트 클리핑

재현성(Reproducibility) 설정

마치며

참고 자료

PyTorch Complete Guide: Zero to Hero — From Tensors to Distributed Training

Introduction

1. Environment Setup

Installing PyTorch

Verifying GPU Availability

2. Tensor Basics

Creating Tensors

Tensor Attributes and Type Conversion

Reshaping Tensors

Tensor Operations

Broadcasting

Indexing and Slicing

3. Automatic Differentiation (Autograd)

requires_grad and Computational Graph

Gradients for Multi-dimensional Tensors

Gradient Control

Higher-order Gradients

4. nn.Module — The Foundation of Neural Networks

Sequential, ModuleList, ModuleDict

5. Linear Regression from Scratch

6. Multi-Layer Perceptron (MLP) — MNIST Classification

7. Convolutional Neural Network (CNN) — CIFAR-10 Classification

8. Recurrent Neural Networks (RNN / LSTM) — Sequential Data

9. Transformer — Multi-head Attention from Scratch

10. Data Loading — Dataset and DataLoader

11. Optimizers — SGD, Adam, AdamW

12. Learning Rate Schedulers

13. Regularization — Dropout, BatchNorm, LayerNorm

14. Transfer Learning

15. Saving and Loading Models

16. TorchScript and Model Deployment

17. Distributed Training (DDP) — DistributedDataParallel

Launching with torchrun

DataParallel vs DistributedDataParallel

18. Advanced Techniques

Mixed Precision Training

Gradient Clipping

Reproducibility

Conclusion

References