Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

들어가며
1. 환경 설정
- PyTorch 설치
- GPU 사용 가능 여부 확인
2. 텐서(Tensor) 기초
- 텐서 생성
- 텐서 속성 및 타입 변환
- 텐서 형태 변환
- 텐서 연산
- 브로드캐스팅
- 인덱싱과 슬라이싱
3. 자동 미분(Autograd)
- requires_grad와 연산 그래프
- 다차원 텐서에서의 역전파
- 그래디언트 제어
- 고계 미분
4. nn.Module — 신경망 구축의 기반
- Sequential, ModuleList, ModuleDict
5. 선형 회귀 구현
6. 다층 퍼셉트론(MLP) — MNIST 분류
7. 합성곱 신경망(CNN) — CIFAR-10 분류
8. 순환 신경망(RNN/LSTM) — 시계열 처리
9. Transformer 구현 — Multi-head Attention from Scratch
10. 데이터 로딩 — Dataset, DataLoader
11. 옵티마이저 — SGD, Adam, AdamW 비교
12. 학습률 스케줄러
13. 정규화 기법 — Dropout, BatchNorm, LayerNorm
14. 전이학습(Transfer Learning)
15. 모델 저장과 로딩
16. TorchScript와 모델 배포
17. 분산 학습(DDP) — DistributedDataParallel
- torchrun으로 실행
- DataParallel vs DistributedDataParallel
18. 고급 기법 모음
- 혼합 정밀도 학습 (Mixed Precision)
- 그래디언트 클리핑
- 재현성(Reproducibility) 설정
마치며
- 참고 자료

들어가며

딥러닝 프레임워크의 양대 산맥이었던 TensorFlow와 PyTorch 중, 연구자와 엔지니어 모두에게 사랑받는 프레임워크는 단연 PyTorch입니다. 2016년 Facebook AI Research(현 Meta AI)가 공개한 이후 PyTorch는 학술 논문 구현의 표준이 되었고, 현재는 산업 현장에서도 TensorFlow를 앞서는 점유율을 기록하고 있습니다.

이 가이드는 Python 기초 지식이 있는 독자를 대상으로, PyTorch를 처음 접하는 단계부터 분산 학습까지 체계적으로 다룹니다. 각 섹션에는 실제로 실행 가능한 코드 예제와 공식 문서 링크를 포함하여, 읽고 바로 실습할 수 있도록 구성했습니다.

공식 문서: https://pytorch.org/docs/stable/index.html 공식 튜토리얼: https://pytorch.org/tutorials/

1. 환경 설정

PyTorch 설치

PyTorch는 pip 또는 conda로 설치합니다. GPU를 사용하려면 CUDA 버전에 맞는 패키지를 선택해야 합니다.

pip으로 설치 (CUDA 12.1 기준):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

conda로 설치:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

CPU 전용 설치:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

GPU 사용 가능 여부 확인

import torch

# PyTorch 버전 확인
print(f"PyTorch 버전: {torch.__version__}")

# CUDA 사용 가능 여부 확인
print(f"CUDA 사용 가능: {torch.cuda.is_available()}")

# GPU 개수 확인
if torch.cuda.is_available():
    print(f"GPU 개수: {torch.cuda.device_count()}")
    print(f"현재 GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU 메모리: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Apple Silicon (M1/M2/M3) MPS 확인
print(f"MPS 사용 가능: {torch.backends.mps.is_available()}")

# 사용할 디바이스 설정 (자동 선택)
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print(f"사용 디바이스: {device}")

설치와 기기 확인이 완료됐다면 이제 PyTorch의 핵심 자료구조인 텐서를 살펴봅시다.

2. 텐서(Tensor) 기초

텐서는 PyTorch의 핵심 자료구조입니다. NumPy의 ndarray와 유사하지만 GPU에서 연산이 가능하고 자동 미분을 지원한다는 점에서 차별화됩니다.

공식 문서: https://pytorch.org/docs/stable/tensors.html

텐서 생성

import torch
import numpy as np

# 직접 데이터로부터 생성
t1 = torch.tensor([1, 2, 3, 4, 5])
print(f"1D 텐서: {t1}, shape: {t1.shape}, dtype: {t1.dtype}")

# 2D 텐서 (행렬)
t2 = torch.tensor([[1.0, 2.0, 3.0],
                   [4.0, 5.0, 6.0]])
print(f"2D 텐서:\n{t2}, shape: {t2.shape}")

# 특수 텐서 생성
zeros = torch.zeros(3, 4)          # 모두 0
ones = torch.ones(2, 3)            # 모두 1
rand = torch.rand(3, 3)            # 0~1 균등분포
randn = torch.randn(3, 3)          # 표준정규분포
eye = torch.eye(4)                  # 단위행렬
arange = torch.arange(0, 10, 2)    # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5) # 균등 간격 5개

print(f"zeros:\n{zeros}")
print(f"randn:\n{randn}")

# 기존 텐서와 같은 크기로 생성
t3 = torch.zeros_like(t2)
t4 = torch.ones_like(t2)
t5 = torch.rand_like(t2)

# NumPy 배열로부터 생성 (메모리 공유)
np_arr = np.array([1.0, 2.0, 3.0])
t_from_np = torch.from_numpy(np_arr)
print(f"NumPy로부터: {t_from_np}")

# 텐서를 NumPy로 변환 (CPU에서만 가능)
np_from_t = t1.numpy()

텐서 속성 및 타입 변환

t = torch.rand(3, 4, 5)

# 기본 속성
print(f"shape: {t.shape}")       # torch.Size([3, 4, 5])
print(f"ndim: {t.ndim}")         # 3
print(f"dtype: {t.dtype}")       # torch.float32
print(f"device: {t.device}")     # cpu
print(f"numel: {t.numel()}")     # 60 (총 원소 수)

# 데이터 타입 변환
t_int = t.to(torch.int32)
t_long = t.long()           # torch.int64
t_float = t.float()         # torch.float32
t_double = t.double()       # torch.float64
t_half = t.half()           # torch.float16

# GPU로 이동
if torch.cuda.is_available():
    t_gpu = t.to("cuda")
    t_gpu2 = t.cuda()        # 동일한 결과
    t_back = t_gpu.cpu()     # 다시 CPU로

텐서 형태 변환

t = torch.arange(24)  # 0~23까지 1D 텐서

# reshape: 원소 수가 같으면 모든 형태 가능
t_2d = t.reshape(4, 6)
t_3d = t.reshape(2, 3, 4)
t_auto = t.reshape(6, -1)  # -1은 자동 계산 (6x4)

# view: reshape와 유사하지만 메모리 연속성 필요
t_view = t.view(3, 8)

# squeeze/unsqueeze: 차원 제거/추가
t = torch.zeros(1, 3, 1, 4)
print(f"원본 shape: {t.shape}")  # [1, 3, 1, 4]

t_sq = t.squeeze()       # 크기 1인 차원 제거 → [3, 4]
t_sq1 = t.squeeze(0)     # 0번 차원만 제거 → [3, 1, 4]
t_unsq = t_sq.unsqueeze(0)  # 0번 위치에 차원 추가 → [1, 3, 4]

# transpose/permute: 차원 순서 변경
t = torch.rand(2, 3, 4)
t_T = t.transpose(0, 1)    # [3, 2, 4]
t_perm = t.permute(2, 0, 1)  # [4, 2, 3]

# contiguous: permute 후 연속 메모리 보장
t_cont = t_perm.contiguous()

print(f"squeeze: {t_sq.shape}")
print(f"permute: {t_perm.shape}")

텐서 연산

a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

# 사칙연산 (원소별)
print(a + b)     # 또는 torch.add(a, b)
print(a - b)     # 또는 torch.sub(a, b)
print(a * b)     # 원소별 곱 (Hadamard product)
print(a / b)     # 원소별 나눗셈
print(a ** 2)    # 원소별 거듭제곱

# 행렬 곱
matmul = a @ b          # 또는 torch.matmul(a, b)
mm = torch.mm(a, b)     # 2D 전용

print(f"행렬 곱:\n{matmul}")

# 집계 연산
t = torch.rand(3, 4)
print(f"합계: {t.sum()}")
print(f"평균: {t.mean()}")
print(f"최대: {t.max()}")
print(f"최소: {t.min()}")
print(f"표준편차: {t.std()}")

# 축 지정 집계
print(f"행별 합: {t.sum(dim=0)}")  # 각 열의 합 (행 방향)
print(f"열별 합: {t.sum(dim=1)}")  # 각 행의 합 (열 방향)
print(f"keepdim:\n{t.sum(dim=1, keepdim=True)}")

# argmax/argmin
print(f"최대값 인덱스: {t.argmax()}")
print(f"행별 최대값 인덱스: {t.argmax(dim=1)}")

브로드캐스팅

NumPy와 동일한 브로드캐스팅 규칙을 따릅니다. 크기가 다른 텐서 간 연산 시 자동으로 확장됩니다.

# 브로드캐스팅 예시
a = torch.tensor([[1, 2, 3],
                  [4, 5, 6]])   # shape: [2, 3]
b = torch.tensor([10, 20, 30])  # shape: [3]

# b가 [2, 3]으로 자동 확장되어 연산
print(a + b)
# tensor([[11, 22, 33],
#         [14, 25, 36]])

# 스칼라 연산도 브로드캐스팅
print(a * 2)   # 모든 원소에 2 곱하기
print(a + 100) # 모든 원소에 100 더하기

# 열 벡터 + 행 벡터
col = torch.tensor([[1], [2], [3]])  # shape: [3, 1]
row = torch.tensor([10, 20, 30])      # shape: [3]
print(col + row)  # shape: [3, 3] — 외적과 유사

인덱싱과 슬라이싱

t = torch.arange(24).reshape(2, 3, 4).float()

# 기본 인덱싱
print(t[0])        # 첫 번째 행렬 (shape: [3, 4])
print(t[0, 1])     # [3, 4] 행렬의 두 번째 행 (shape: [4])
print(t[0, 1, 2])  # 스칼라

# 슬라이싱
print(t[:, 1:, :2])  # 전체, 1번 이후, 처음 2개 열

# 고급 인덱싱 (Fancy indexing)
indices = torch.tensor([0, 2])
print(t[:, indices, :])  # 0번, 2번 행만 선택

# 조건부 인덱싱 (Boolean masking)
mask = t > 10
print(t[mask])  # 10보다 큰 원소만 추출 (1D 텐서 반환)

# where: 조건에 따라 두 텐서에서 선택
a = torch.tensor([1.0, 2.0, 3.0, 4.0])
b = torch.tensor([10.0, 20.0, 30.0, 40.0])
condition = a > 2
result = torch.where(condition, b, a)
print(result)  # tensor([ 1.,  2., 30., 40.])

3. 자동 미분(Autograd)

PyTorch의 핵심 기능 중 하나인 Autograd는 연산 그래프를 자동으로 구축하고, 역전파(backpropagation)를 통해 그래디언트를 계산합니다.

공식 문서: https://pytorch.org/docs/stable/autograd.html

requires_grad와 연산 그래프

import torch

# requires_grad=True로 텐서 생성 → 연산 추적 시작
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

# 연산 수행 → 연산 그래프 구축
z = x ** 2 + 2 * x * y + y ** 2  # (x + y)^2

print(f"z = {z}")  # z = 49.0

# 역전파 수행
z.backward()

# 그래디언트 확인
# dz/dx = 2x + 2y = 2*3 + 2*4 = 14
print(f"dz/dx = {x.grad}")  # 14.0

# dz/dy = 2x + 2y = 14
print(f"dz/dy = {y.grad}")  # 14.0

다차원 텐서에서의 역전파

# 벡터 함수의 그래디언트
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2  # [1, 4, 9]
z = y.sum()  # 스칼라로 축소

z.backward()
print(f"x.grad: {x.grad}")  # [2, 4, 6] (dy/dx = 2x)

# gradient 인자: 비스칼라 backward
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2  # [1, 4, 9]

# y.backward()는 에러 → y는 스칼라가 아님
# gradient 인자로 가중합 계산
grad_output = torch.tensor([1.0, 1.0, 1.0])  # 각 원소의 가중치
y.backward(gradient=grad_output)
print(f"x.grad: {x.grad}")  # [2, 4, 6]

그래디언트 제어

# 그래디언트 누적 문제 — 초기화 필요
x = torch.tensor(2.0, requires_grad=True)

for i in range(3):
    y = x ** 2
    y.backward()
    print(f"iteration {i}: x.grad = {x.grad}")
    # 매번 초기화하지 않으면 누적됨
    x.grad.zero_()  # in-place 초기화

# no_grad: 추론 시 그래디언트 비활성화 (메모리 절약)
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

with torch.no_grad():
    y = x ** 2  # 연산 그래프 미생성
    print(f"y.requires_grad: {y.requires_grad}")  # False

# detach: 연산 그래프에서 분리
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
z = y.detach()  # 그래디언트 추적 분리
print(f"z.requires_grad: {z.requires_grad}")  # False

# 파라미터 일부 고정 (Transfer Learning 시 유용)
for param in model.parameters():
    param.requires_grad = False

고계 미분

# 2차 미분 예시
x = torch.tensor(3.0, requires_grad=True)
y = x ** 4

# 1차 미분: dy/dx = 4x^3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"1차 미분: {dy_dx}")  # 108

# 2차 미분: d2y/dx2 = 12x^2
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"2차 미분: {d2y_dx2}")  # 108

4. nn.Module — 신경망 구축의 기반

torch.nn.Module은 모든 PyTorch 모델의 기반 클래스입니다. 레이어, 활성화 함수, 전체 모델 모두 이 클래스를 상속합니다.

공식 문서: https://pytorch.org/docs/stable/nn.html

import torch
import torch.nn as nn

# 간단한 모델 정의
class SimpleModel(nn.Module):
    def __init__(self, in_features, hidden_size, out_features):
        super().__init__()
        # 레이어 정의 (파라미터 자동 등록)
        self.fc1 = nn.Linear(in_features, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, out_features)
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        # 순전파 정의
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# 모델 인스턴스 생성
model = SimpleModel(784, 256, 10)
print(model)

# 파라미터 수 확인
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"전체 파라미터: {total_params:,}")
print(f"학습 가능 파라미터: {trainable_params:,}")

# 파라미터 접근
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

# 순전파 실행
x = torch.randn(32, 784)  # batch_size=32, features=784
output = model(x)
print(f"출력 shape: {output.shape}")  # [32, 10]

Sequential, ModuleList, ModuleDict

# Sequential: 순차적 레이어 구성
seq_model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# ModuleList: 리스트로 레이어 관리
class ResidualBlock(nn.Module):
    def __init__(self, num_blocks, hidden_size):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(hidden_size, hidden_size)
            for _ in range(num_blocks)
        ])
        self.relu = nn.ReLU()

    def forward(self, x):
        for layer in self.layers:
            x = self.relu(layer(x)) + x  # 잔차 연결
        return x

# ModuleDict: 딕셔너리로 레이어 관리
class MultiTaskModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.Linear(784, 256)
        self.heads = nn.ModuleDict({
            'classification': nn.Linear(256, 10),
            'regression': nn.Linear(256, 1)
        })

    def forward(self, x, task='classification'):
        features = torch.relu(self.backbone(x))
        return self.heads[task](features)

5. 선형 회귀 구현

선형 회귀는 딥러닝의 가장 기본적인 모델입니다. 처음부터 구현하며 PyTorch의 학습 루프를 이해합니다.

import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

# 데이터 생성
torch.manual_seed(42)
n_samples = 200

# y = 3x + 2 + 노이즈
X = torch.linspace(-5, 5, n_samples).unsqueeze(1)  # [200, 1]
y_true = 3 * X + 2
y = y_true + torch.randn_like(y_true) * 0.5       # 노이즈 추가

# 모델 정의
class LinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

model = LinearRegression()

# 손실 함수와 옵티마이저
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 학습 루프
n_epochs = 1000
losses = []

for epoch in range(n_epochs):
    # 1. 순전파 (Forward Pass)
    y_pred = model(X)

    # 2. 손실 계산
    loss = criterion(y_pred, y)
    losses.append(loss.item())

    # 3. 그래디언트 초기화 (중요!)
    optimizer.zero_grad()

    # 4. 역전파 (Backward Pass)
    loss.backward()

    # 5. 파라미터 업데이트
    optimizer.step()

    if (epoch + 1) % 200 == 0:
        w = model.linear.weight.item()
        b = model.linear.bias.item()
        print(f"Epoch {epoch+1}: Loss={loss.item():.4f}, w={w:.4f}, b={b:.4f}")

# 결과 확인
print(f"\n학습된 가중치: {model.linear.weight.item():.4f} (정답: 3.0)")
print(f"학습된 편향: {model.linear.bias.item():.4f} (정답: 2.0)")

6. 다층 퍼셉트론(MLP) — MNIST 분류

MNIST 손글씨 숫자 데이터셋으로 완전한 분류 모델을 구축합니다.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# 하이퍼파라미터
BATCH_SIZE = 64
LEARNING_RATE = 0.001
N_EPOCHS = 10
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 데이터 전처리 및 로딩
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST 평균, 표준편차
])

train_dataset = datasets.MNIST(root='./data', train=True,
                                download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False,
                               transform=transform)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                          shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                         shuffle=False, num_workers=2)

# MLP 모델 정의
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Flatten(),              # 28x28 → 784
            nn.Linear(784, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.network(x)

model = MLP().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# 학습 함수
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(loader):
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        pred = output.argmax(dim=1)
        correct += pred.eq(target).sum().item()
        total += target.size(0)

    return total_loss / len(loader), 100.0 * correct / total

# 평가 함수
def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            total_loss += criterion(output, target).item()
            pred = output.argmax(dim=1)
            correct += pred.eq(target).sum().item()
            total += target.size(0)

    return total_loss / len(loader), 100.0 * correct / total

# 학습 실행
for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, DEVICE)
    test_loss, test_acc = evaluate(model, test_loader, criterion, DEVICE)
    print(f"Epoch {epoch+1}/{N_EPOCHS} | "
          f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | "
          f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")

7. 합성곱 신경망(CNN) — CIFAR-10 분류

이미지 분류의 핵심인 CNN을 구현하고, CIFAR-10 데이터셋으로 학습합니다.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# CIFAR-10 데이터 준비
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010))
])

train_data = datasets.CIFAR10('./data', train=True, download=True, transform=transform_train)
test_data = datasets.CIFAR10('./data', train=False, transform=transform_test)

train_loader = DataLoader(train_data, batch_size=128, shuffle=True, num_workers=4)
test_loader = DataLoader(test_data, batch_size=128, shuffle=False, num_workers=4)

CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck']

# CNN 모델 정의 (VGG 스타일)
class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # 특징 추출부
        self.features = nn.Sequential(
            # Block 1: 3 → 64
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),      # 32x32 → 16x16
            nn.Dropout2d(0.1),

            # Block 2: 64 → 128
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),      # 16x16 → 8x8
            nn.Dropout2d(0.2),

            # Block 3: 128 → 256
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),      # 8x8 → 4x4
        )

        # 분류부
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(256 * 4 * 4, 1024),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(1024, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

model = CNN().to(DEVICE)
print(f"모델 파라미터 수: {sum(p.numel() for p in model.parameters()):,}")

8. 순환 신경망(RNN/LSTM) — 시계열 처리

시계열 데이터나 텍스트 처리에 적합한 RNN과 LSTM을 구현합니다.

import torch
import torch.nn as nn
import numpy as np

# LSTM 기반 시계열 예측 모델
class LSTMPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2,
                 output_size=1, dropout=0.2):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # LSTM 레이어
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,     # 입력: [batch, seq_len, features]
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=False
        )

        # 출력 레이어
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 32),
            nn.ReLU(),
            nn.Linear(32, output_size)
        )

    def forward(self, x):
        # x shape: [batch_size, seq_len, input_size]
        batch_size = x.size(0)

        # 초기 hidden/cell state (0으로 초기화)
        h0 = torch.zeros(self.num_layers, batch_size,
                         self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, batch_size,
                         self.hidden_size).to(x.device)

        # LSTM 순전파
        # out: [batch_size, seq_len, hidden_size]
        out, (hn, cn) = self.lstm(x, (h0, c0))

        # 마지막 시퀀스의 출력만 사용
        out = self.fc(out[:, -1, :])  # [batch_size, output_size]
        return out

# 사인파 데이터로 예시
t = np.linspace(0, 100, 1000)
data = np.sin(0.5 * t) + 0.1 * np.random.randn(1000)
data = torch.FloatTensor(data).unsqueeze(1)

# 시퀀스 데이터 생성 함수
def create_sequences(data, seq_len=50):
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i:i+seq_len])
        y.append(data[i+seq_len])
    return torch.stack(X), torch.stack(y)

X, y = create_sequences(data, seq_len=50)
print(f"X shape: {X.shape}")  # [950, 50, 1]
print(f"y shape: {y.shape}")  # [950, 1]

# GRU — LSTM보다 파라미터 적은 변형
class GRUPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2):
        super().__init__()
        self.gru = nn.GRU(input_size, hidden_size, num_layers,
                          batch_first=True, dropout=0.2)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        out, _ = self.gru(x)
        return self.fc(out[:, -1, :])

9. Transformer 구현 — Multi-head Attention from Scratch

Attention Is All You Need 논문의 핵심 구성 요소를 직접 구현합니다.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # 헤드당 차원

        # Q, K, V, 출력 프로젝션
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)
        self.scale = math.sqrt(self.d_k)

    def split_heads(self, x):
        # x: [batch, seq, d_model] → [batch, num_heads, seq, d_k]
        batch, seq, _ = x.shape
        x = x.view(batch, seq, self.num_heads, self.d_k)
        return x.transpose(1, 2)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # 1. Q, K, V 선형 변환 및 헤드 분리
        Q = self.split_heads(self.W_q(query))  # [B, H, Sq, dk]
        K = self.split_heads(self.W_k(key))    # [B, H, Sk, dk]
        V = self.split_heads(self.W_v(value))  # [B, H, Sk, dk]

        # 2. Scaled Dot-Product Attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        # scores: [B, H, Sq, Sk]

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # 3. Value와 가중합
        context = torch.matmul(attn_weights, V)  # [B, H, Sq, dk]

        # 4. 헤드 합치기
        context = context.transpose(1, 2).contiguous()
        context = context.view(batch_size, -1, self.d_model)

        # 5. 출력 프로젝션
        output = self.W_o(context)
        return output, attn_weights

class FeedForward(nn.Module):
    def __init__(self, d_model=512, d_ff=2048, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )

    def forward(self, x):
        return self.net(x)

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.ff = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-Norm 방식 (원 논문은 Post-Norm)
        attn_out, _ = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_out))

        ff_out = self.ff(x)
        x = self.norm2(x + self.dropout(ff_out))
        return x

class PositionalEncoding(nn.Module):
    def __init__(self, d_model=512, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        # 위치 인코딩 계산
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # [1, max_len, d_model]
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

# 사용 예시
d_model = 512
encoder_layer = TransformerEncoderLayer(d_model=d_model, num_heads=8)
pos_enc = PositionalEncoding(d_model=d_model)

x = torch.randn(2, 10, d_model)  # [batch=2, seq=10, d_model=512]
x = pos_enc(x)
output = encoder_layer(x)
print(f"Transformer Encoder 출력: {output.shape}")  # [2, 10, 512]

10. 데이터 로딩 — Dataset, DataLoader

효율적인 데이터 파이프라인 구축은 학습 속도와 직결됩니다.

공식 튜토리얼: https://pytorch.org/tutorials/beginner/basics/intro.html

import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from PIL import Image
import os

# 커스텀 Dataset 구현
class CustomImageDataset(Dataset):
    def __init__(self, csv_file, img_dir, transform=None):
        """
        csv_file: 이미지 경로와 레이블이 있는 CSV
        img_dir: 이미지 루트 디렉토리
        transform: torchvision transforms
        """
        self.annotations = pd.read_csv(csv_file)
        self.img_dir = img_dir
        self.transform = transform

    def __len__(self):
        # 데이터셋 크기 반환 (필수)
        return len(self.annotations)

    def __getitem__(self, idx):
        # 인덱스로 샘플 반환 (필수)
        img_path = os.path.join(self.img_dir, self.annotations.iloc[idx, 0])
        image = Image.open(img_path).convert('RGB')
        label = int(self.annotations.iloc[idx, 1])

        if self.transform:
            image = self.transform(image)

        return image, label

# 수치 데이터용 Dataset
class TabularDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.FloatTensor(X)
        self.y = torch.LongTensor(y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# DataLoader 고급 활용
dataset = TabularDataset(
    X=np.random.randn(1000, 20),
    y=np.random.randint(0, 5, 1000)
)

# 기본 DataLoader
basic_loader = DataLoader(dataset, batch_size=32, shuffle=True)

# 고급 설정
advanced_loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,        # 병렬 데이터 로딩 (CPU 코어 수에 맞게)
    pin_memory=True,      # GPU 전송 가속 (CUDA 사용 시)
    drop_last=True,       # 마지막 불완전 배치 제거
    prefetch_factor=2,    # 미리 로드할 배치 수
    persistent_workers=True  # 워커 프로세스 재사용
)

# 배치 확인
for batch_X, batch_y in advanced_loader:
    print(f"배치 X: {batch_X.shape}")  # [64, 20]
    print(f"배치 y: {batch_y.shape}")  # [64]
    break

# WeightedRandomSampler: 클래스 불균형 처리
from torch.utils.data import WeightedRandomSampler

class_counts = [800, 150, 50]  # 클래스별 샘플 수
weights = 1.0 / torch.tensor(class_counts, dtype=torch.float)
# 각 샘플에 클래스 가중치 할당
sample_weights = weights[dataset.y]  # 각 샘플의 가중치

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(dataset),
    replacement=True
)

balanced_loader = DataLoader(dataset, batch_size=32, sampler=sampler)

11. 옵티마이저 — SGD, Adam, AdamW 비교

공식 문서: https://pytorch.org/docs/stable/optim.html

import torch.optim as optim

# 모델 예시
model = nn.Linear(100, 10)

# SGD (Stochastic Gradient Descent)
sgd = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,      # 이전 업데이트 방향 유지
    weight_decay=1e-4, # L2 정규화
    nesterov=True      # Nesterov momentum
)

# Adam: 적응형 학습률
adam = optim.Adam(
    model.parameters(),
    lr=0.001,
    betas=(0.9, 0.999),  # 1차, 2차 모멘트 감쇠율
    eps=1e-8,
    weight_decay=0
)

# AdamW: Adam + 올바른 Weight Decay
# 주의: Adam의 weight_decay는 L2 정규화와 다름
# AdamW가 Transformer 계열 모델에 권장됨
adamw = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    weight_decay=0.01   # 학습률과 독립적으로 적용
)

# RMSprop: 순환 신경망에 효과적
rmsprop = optim.RMSprop(
    model.parameters(),
    lr=0.01,
    alpha=0.99,
    momentum=0.0
)

# 파라미터 그룹별 다른 학습률 설정 (Transfer Learning에 유용)
optimizer = optim.Adam([
    {'params': model.features.parameters(), 'lr': 1e-4},  # 백본: 낮은 LR
    {'params': model.classifier.parameters(), 'lr': 1e-3} # 헤드: 높은 LR
], lr=1e-3)

# 옵티마이저 상태 저장/복원
checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch': 10
}
torch.save(checkpoint, 'checkpoint.pt')

# 복원
ckpt = torch.load('checkpoint.pt')
model.load_state_dict(ckpt['model'])
optimizer.load_state_dict(ckpt['optimizer'])
start_epoch = ckpt['epoch']

12. 학습률 스케줄러

고정 학습률보다 스케줄러를 사용하면 대부분의 경우 성능이 향상됩니다.

import torch.optim as optim
from torch.optim.lr_scheduler import (
    StepLR, MultiStepLR, ExponentialLR,
    CosineAnnealingLR, OneCycleLR,
    ReduceLROnPlateau, CosineAnnealingWarmRestarts
)

model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)

# StepLR: 매 step_size 에포크마다 gamma 배 감소
step_scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# 0~29: lr=0.1, 30~59: lr=0.01, 60~89: lr=0.001

# MultiStepLR: 지정 에포크에서 감소
multi_scheduler = MultiStepLR(optimizer, milestones=[50, 100, 150], gamma=0.1)

# CosineAnnealingLR: 코사인 주기로 감소
cosine_scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# ReduceLROnPlateau: 검증 손실 개선 없을 때 감소 (가장 실용적)
plateau_scheduler = ReduceLROnPlateau(
    optimizer,
    mode='min',       # 최소화 목표 (loss)
    factor=0.5,       # 감소 비율
    patience=10,      # 개선 없는 에포크 수 허용
    min_lr=1e-7,
    verbose=True
)

# OneCycleLR: 빠른 수렴 (슈퍼 수렴)
one_cycle = OneCycleLR(
    optimizer,
    max_lr=0.01,
    steps_per_epoch=100,  # len(train_loader)
    epochs=30,
    pct_start=0.3,        # 전체의 30%를 warm-up에 사용
    anneal_strategy='cos'
)

# CosineAnnealingWarmRestarts: warm restart로 주기적 리셋
warm_restart = CosineAnnealingWarmRestarts(
    optimizer,
    T_0=10,    # 첫 번째 리셋까지 에포크 수
    T_mult=2,  # 리셋마다 주기 T_mult배 증가
    eta_min=1e-6
)

# 학습 루프에서 스케줄러 사용
for epoch in range(100):
    train_loss = 0.5  # 실제 학습 루프 결과

    # 대부분의 스케줄러: epoch 단위로 step
    cosine_scheduler.step()

    # ReduceLROnPlateau: 검증 지표를 인자로 전달
    plateau_scheduler.step(train_loss)

    # OneCycleLR: 배치 단위로 step
    # for batch in loader:
    #     ...
    #     one_cycle.step()

    print(f"Epoch {epoch+1}: LR = {optimizer.param_groups[0]['lr']:.6f}")

13. 정규화 기법 — Dropout, BatchNorm, LayerNorm

과적합을 방지하고 학습을 안정화하는 정규화 기법들을 정리합니다.

import torch.nn as nn

# Dropout: 학습 시 무작위로 뉴런 비활성화
class DropoutDemo(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(100, 50)
        self.dropout = nn.Dropout(p=0.5)  # 50% 비활성화
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)  # 학습 시만 활성, 평가 시 자동 비활성
        return self.fc2(x)

# model.train() → Dropout 활성화
# model.eval() → Dropout 비활성화

# BatchNorm1d: 미니배치 정규화 (FC 레이어 후)
bn_model = nn.Sequential(
    nn.Linear(100, 64),
    nn.BatchNorm1d(64),  # 배치 차원으로 정규화
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.BatchNorm1d(32),
    nn.ReLU(),
    nn.Linear(32, 10)
)

# BatchNorm2d: 2D feature map (CNN 레이어 후)
cnn_with_bn = nn.Sequential(
    nn.Conv2d(3, 32, 3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.Conv2d(32, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU()
)

# LayerNorm: 특징 차원으로 정규화 (Transformer에 주로 사용)
# BatchNorm과 달리 배치 크기에 독립적
transformer_norm = nn.Sequential(
    nn.Linear(512, 512),
    nn.LayerNorm(512),  # 마지막 차원(512)으로 정규화
    nn.ReLU()
)

# GroupNorm: BatchNorm과 LayerNorm의 절충안 (소규모 배치에 유용)
group_norm = nn.GroupNorm(
    num_groups=8,    # 채널을 8그룹으로 분할
    num_channels=64  # 총 채널 수
)

# InstanceNorm: 스타일 전이 등에 활용
instance_norm = nn.InstanceNorm2d(64)

# 정규화 방법 비교 요약:
# BatchNorm  : 배치 × 공간 정규화 → CNN에 효과적, 배치 크기 의존
# LayerNorm  : 특징 차원 정규화 → Transformer, RNN에 효과적
# GroupNorm  : 소규모 배치에서 BatchNorm 대안
# InstanceNorm: 스타일 전이, 이미지 생성에 활용

14. 전이학습(Transfer Learning)

ImageNet으로 사전학습된 모델을 활용하여 적은 데이터로도 높은 성능을 냅니다.

import torchvision.models as models
import torch.nn as nn

# 사전학습 모델 로딩
# weights 인자로 명시적 지정 권장 (최신 API)
resnet50 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
vgg16 = models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1)
efficientnet = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.IMAGENET1K_V1)
vit = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)

print(resnet50)  # 구조 출력

# 방법 1: Feature Extractor (백본 동결)
# 백본 파라미터 고정
for param in resnet50.parameters():
    param.requires_grad = False

# 마지막 FC 레이어만 교체 (새 클래스 수에 맞게)
num_classes = 5
resnet50.fc = nn.Linear(resnet50.fc.in_features, num_classes)

# 마지막 레이어만 학습됨
trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"학습 가능 파라미터: {trainable:,}")  # 약 2,050개만

# 방법 2: Fine-tuning (전체 또는 일부 레이어 학습)
resnet_ft = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
resnet_ft.fc = nn.Linear(resnet_ft.fc.in_features, num_classes)

# 레이어별 다른 학습률 (낮은 레이어 = 낮은 LR)
optimizer = torch.optim.AdamW([
    {'params': resnet_ft.layer1.parameters(), 'lr': 1e-5},
    {'params': resnet_ft.layer2.parameters(), 'lr': 1e-5},
    {'params': resnet_ft.layer3.parameters(), 'lr': 1e-4},
    {'params': resnet_ft.layer4.parameters(), 'lr': 1e-4},
    {'params': resnet_ft.fc.parameters(),     'lr': 1e-3},
], lr=1e-4, weight_decay=0.01)

# 방법 3: torchvision transforms로 데이터 전처리
from torchvision import transforms

# 사전학습 모델의 입력 정규화 값 (ImageNet 기준)
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

15. 모델 저장과 로딩

학습한 모델을 저장하고 재사용하는 방법을 알아봅니다.

import torch
import torch.nn as nn

model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters())

# 방법 1: state_dict 저장 (권장)
# 모델 파라미터만 저장 (아키텍처 제외)
torch.save(model.state_dict(), 'model_weights.pt')

# 로딩
loaded_model = nn.Linear(10, 5)  # 동일한 아키텍처 필요
loaded_model.load_state_dict(torch.load('model_weights.pt',
                                         weights_only=True))
loaded_model.eval()

# 방법 2: 전체 모델 저장 (권장하지 않음 — 이식성 낮음)
torch.save(model, 'full_model.pt')
loaded_full = torch.load('full_model.pt', weights_only=False)

# 방법 3: 체크포인트 — 학습 재개를 위한 완전한 상태 저장
def save_checkpoint(model, optimizer, scheduler, epoch, loss, path):
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict() if scheduler else None,
        'loss': loss,
    }, path)
    print(f"체크포인트 저장: {path}")

def load_checkpoint(path, model, optimizer=None, scheduler=None):
    checkpoint = torch.load(path, map_location='cpu', weights_only=True)
    model.load_state_dict(checkpoint['model_state_dict'])

    if optimizer:
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    if scheduler and checkpoint['scheduler_state_dict']:
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])

    return checkpoint['epoch'], checkpoint['loss']

# 사용 예시
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
save_checkpoint(model, optimizer, scheduler, epoch=50, loss=0.25,
                path='checkpoint_ep50.pt')

start_epoch, prev_loss = load_checkpoint('checkpoint_ep50.pt',
                                          model, optimizer, scheduler)
print(f"재개: epoch={start_epoch}, loss={prev_loss:.4f}")

# GPU 모델을 CPU로 로딩
model_cpu = nn.Linear(10, 5)
model_cpu.load_state_dict(
    torch.load('model_weights.pt', map_location='cpu', weights_only=True)
)

16. TorchScript와 모델 배포

학습된 모델을 프로덕션 환경에 배포하는 방법을 다룹니다.

import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 5)

    def forward(self, x):
        return torch.relu(self.fc(x))

model = SimpleNet()
model.eval()

# 방법 1: torch.jit.script — 전체 모델 컴파일
scripted_model = torch.jit.script(model)

# 저장 및 로딩
scripted_model.save('model_scripted.pt')
loaded_scripted = torch.jit.load('model_scripted.pt')

x = torch.randn(4, 10)
with torch.no_grad():
    out = loaded_scripted(x)
print(f"TorchScript 출력: {out.shape}")

# 방법 2: torch.jit.trace — 예제 입력으로 추적
example_input = torch.randn(1, 10)
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')

# 방법 3: ONNX 내보내기 (다른 프레임워크 호환)
import torch.onnx

dummy_input = torch.randn(1, 10)
torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    export_params=True,
    opset_version=17,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)
print("ONNX 내보내기 완료")

# torch.compile (PyTorch 2.0+): 최신 컴파일 방식
# 기존 코드 변경 없이 적용 가능
compiled_model = torch.compile(model)
out = compiled_model(x)
print(f"torch.compile 출력: {out.shape}")

17. 분산 학습(DDP) — DistributedDataParallel

여러 GPU를 활용해 학습 속도를 크게 높이는 방법입니다.

공식 튜토리얼: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

# train_ddp.py — 단독 실행 스크립트로 작성
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import datasets, transforms

def setup(rank, world_size):
    """프로세스 그룹 초기화"""
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # nccl: GPU 통신 백엔드 (권장)
    # gloo: CPU 또는 디버깅용
    dist.init_process_group(
        backend='nccl',
        rank=rank,
        world_size=world_size
    )

def cleanup():
    """프로세스 그룹 정리"""
    dist.destroy_process_group()

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.net(x.view(x.size(0), -1))

def train(rank, world_size, num_epochs=5):
    print(f"프로세스 {rank}/{world_size} 시작")
    setup(rank, world_size)

    # 각 프로세스에 GPU 할당
    torch.cuda.set_device(rank)
    device = torch.device(f'cuda:{rank}')

    # 모델 생성 및 DDP 래핑
    model = SimpleModel().to(device)
    ddp_model = DDP(model, device_ids=[rank])

    # 데이터셋과 분산 샘플러
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('./data', train=True, download=True,
                              transform=transform)

    # DistributedSampler: 각 프로세스에 다른 데이터 할당
    sampler = DistributedSampler(
        dataset,
        num_replicas=world_size,
        rank=rank,
        shuffle=True
    )

    loader = DataLoader(
        dataset,
        batch_size=128,
        sampler=sampler,
        num_workers=4,
        pin_memory=True
    )

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.001)

    for epoch in range(num_epochs):
        # 매 에포크마다 샘플러 시드 업데이트 (데이터 섞기)
        sampler.set_epoch(epoch)

        ddp_model.train()
        total_loss = 0.0

        for batch_idx, (data, target) in enumerate(loader):
            data, target = data.to(device), target.to(device)

            optimizer.zero_grad()
            output = ddp_model(data)
            loss = criterion(output, target)
            loss.backward()  # 자동으로 그래디언트 동기화
            optimizer.step()

            total_loss += loss.item()

        # rank 0에서만 로그 출력
        if rank == 0:
            avg_loss = total_loss / len(loader)
            print(f"Epoch {epoch+1}: Average Loss = {avg_loss:.4f}")

    cleanup()

# 실행: torchrun --nproc_per_node=4 train_ddp.py
if __name__ == '__main__':
    import torch.multiprocessing as mp
    world_size = torch.cuda.device_count()
    mp.spawn(
        train,
        args=(world_size, 5),
        nprocs=world_size,
        join=True
    )

torchrun으로 실행

# 단일 노드 4 GPU 학습
torchrun --nproc_per_node=4 train_ddp.py

# 다중 노드 학습 (노드 0)
torchrun --nnodes=2 --nproc_per_node=4 \
         --node_rank=0 \
         --master_addr="192.168.1.100" \
         --master_port=12355 \
         train_ddp.py

DataParallel vs DistributedDataParallel

# DataParallel (DP): 단순하지만 비효율적
# - 모든 그래디언트가 GPU 0으로 집결 → 병목
# - 멀티 프로세스가 아닌 멀티 스레드 방식
model_dp = nn.DataParallel(model, device_ids=[0, 1, 2, 3])

# DistributedDataParallel (DDP): 권장 방식
# - 각 GPU가 독립적으로 그래디언트 계산
# - All-Reduce로 효율적 동기화
# - 단일 GPU에서도 DDP가 빠름 (Python GIL 회피)
model_ddp = DDP(model, device_ids=[rank])

18. 고급 기법 모음

혼합 정밀도 학습 (Mixed Precision)

from torch.cuda.amp import autocast, GradScaler

model = SimpleModel().to('cuda')
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()  # FP16 손실 스케일링

for data, target in train_loader:
    data, target = data.to('cuda'), target.to('cuda')
    optimizer.zero_grad()

    # FP16으로 순전파
    with autocast():
        output = model(data)
        loss = criterion(output, target)

    # 스케일된 역전파
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

그래디언트 클리핑

# 폭발하는 그래디언트 방지
max_grad_norm = 1.0
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()

재현성(Reproducibility) 설정

import random
import numpy as np

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # 완전 재현성을 위해 (성능 저하 있음)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)

마치며

이 가이드에서는 PyTorch의 핵심 개념부터 실용적인 분산 학습까지 체계적으로 살펴봤습니다. 다음은 학습 로드맵입니다.

기초 단계: 텐서 조작, Autograd, 단순 모델 구현
중급 단계: CNN, RNN, Transfer Learning, DataLoader 최적화
고급 단계: Transformer, 분산 학습(DDP), Mixed Precision
배포 단계: TorchScript, ONNX, torch.compile

PyTorch 생태계는 지속적으로 발전하고 있습니다. 최신 기능과 업데이트는 공식 문서와 PyTorch 블로그를 참조하세요.