- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- 1. Environment Setup
- 2. Tensor Basics
- 3. Automatic Differentiation (Autograd)
- 4. nn.Module — The Foundation of Neural Networks
- 5. Linear Regression from Scratch
- 6. Multi-Layer Perceptron (MLP) — MNIST Classification
- 7. Convolutional Neural Network (CNN) — CIFAR-10 Classification
- 8. Recurrent Neural Networks (RNN / LSTM) — Sequential Data
- 9. Transformer — Multi-head Attention from Scratch
- 10. Data Loading — Dataset and DataLoader
- 11. Optimizers — SGD, Adam, AdamW
- 12. Learning Rate Schedulers
- 13. Regularization — Dropout, BatchNorm, LayerNorm
- 14. Transfer Learning
- 15. Saving and Loading Models
- 16. TorchScript and Model Deployment
- 17. Distributed Training (DDP) — DistributedDataParallel
- 18. Advanced Techniques
- Conclusion
Introduction
Of the two dominant deep learning frameworks — TensorFlow and PyTorch — PyTorch has become the preferred choice for researchers and engineers alike. Released by Facebook AI Research (now Meta AI) in 2016, PyTorch quickly became the standard for implementing academic papers and now surpasses TensorFlow in industrial adoption as well.
This guide targets readers with basic Python knowledge and walks through everything from first contact with PyTorch all the way to distributed training. Each section includes runnable code examples and links to the official documentation so you can read and practice simultaneously.
Official docs: https://pytorch.org/docs/stable/index.html Official tutorials: https://pytorch.org/tutorials/
1. Environment Setup
Installing PyTorch
PyTorch can be installed via pip or conda. To use a GPU, select the package matching your CUDA version.
pip install (CUDA 12.1):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
conda install:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
CPU-only install:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Verifying GPU Availability
import torch
# Check PyTorch version
print(f"PyTorch version: {torch.__version__}")
# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
# GPU count and info
if torch.cuda.is_available():
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# Apple Silicon (M1/M2/M3) MPS check
print(f"MPS available: {torch.backends.mps.is_available()}")
# Auto-select device
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
print(f"Using device: {device}")
2. Tensor Basics
Tensors are the foundational data structure of PyTorch. They are similar to NumPy's ndarray but support GPU computation and automatic differentiation.
Official docs: https://pytorch.org/docs/stable/tensors.html
Creating Tensors
import torch
import numpy as np
# From data directly
t1 = torch.tensor([1, 2, 3, 4, 5])
print(f"1D tensor: {t1}, shape: {t1.shape}, dtype: {t1.dtype}")
# 2D tensor (matrix)
t2 = torch.tensor([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]])
print(f"2D tensor:\n{t2}, shape: {t2.shape}")
# Special tensor creation
zeros = torch.zeros(3, 4) # all zeros
ones = torch.ones(2, 3) # all ones
rand = torch.rand(3, 3) # uniform [0, 1)
randn = torch.randn(3, 3) # standard normal
eye = torch.eye(4) # identity matrix
arange = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5) # 5 evenly spaced values
# Create with same shape as existing tensor
t3 = torch.zeros_like(t2)
t4 = torch.ones_like(t2)
t5 = torch.rand_like(t2)
# From NumPy array (shared memory)
np_arr = np.array([1.0, 2.0, 3.0])
t_from_np = torch.from_numpy(np_arr)
# Tensor to NumPy (CPU only)
np_from_t = t1.numpy()
Tensor Attributes and Type Conversion
t = torch.rand(3, 4, 5)
print(f"shape: {t.shape}") # torch.Size([3, 4, 5])
print(f"ndim: {t.ndim}") # 3
print(f"dtype: {t.dtype}") # torch.float32
print(f"device: {t.device}") # cpu
print(f"numel: {t.numel()}") # 60 (total elements)
# Type conversion
t_int = t.to(torch.int32)
t_long = t.long() # torch.int64
t_float = t.float() # torch.float32
t_double = t.double() # torch.float64
t_half = t.half() # torch.float16
# Move to GPU
if torch.cuda.is_available():
t_gpu = t.to("cuda")
t_back = t_gpu.cpu() # back to CPU
Reshaping Tensors
t = torch.arange(24) # 1D tensor 0..23
t_2d = t.reshape(4, 6)
t_3d = t.reshape(2, 3, 4)
t_auto = t.reshape(6, -1) # -1 infers the size (6x4)
# view: like reshape but requires contiguous memory
t_view = t.view(3, 8)
# squeeze / unsqueeze
t = torch.zeros(1, 3, 1, 4)
t_sq = t.squeeze() # remove size-1 dims → [3, 4]
t_sq1 = t.squeeze(0) # remove dim 0 only → [3, 1, 4]
t_unsq = t_sq.unsqueeze(0) # add dim at 0 → [1, 3, 4]
# transpose / permute
t = torch.rand(2, 3, 4)
t_T = t.transpose(0, 1) # [3, 2, 4]
t_perm = t.permute(2, 0, 1) # [4, 2, 3]
t_cont = t_perm.contiguous() # ensure contiguous memory
Tensor Operations
a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])
# Element-wise arithmetic
print(a + b) # torch.add(a, b)
print(a - b) # torch.sub(a, b)
print(a * b) # Hadamard product
print(a / b)
print(a ** 2)
# Matrix multiplication
matmul = a @ b # or torch.matmul(a, b)
mm = torch.mm(a, b) # 2D only
# Reduction
t = torch.rand(3, 4)
print(t.sum())
print(t.mean())
print(t.max())
print(t.std())
print(t.sum(dim=0)) # sum along rows
print(t.sum(dim=1, keepdim=True))
# argmax / argmin
print(t.argmax())
print(t.argmax(dim=1))
Broadcasting
a = torch.tensor([[1, 2, 3],
[4, 5, 6]]) # shape: [2, 3]
b = torch.tensor([10, 20, 30]) # shape: [3]
# b is broadcast to [2, 3]
print(a + b)
# tensor([[11, 22, 33],
# [14, 25, 36]])
# Column vector + row vector
col = torch.tensor([[1], [2], [3]]) # [3, 1]
row = torch.tensor([10, 20, 30]) # [3]
print(col + row) # [3, 3] outer-sum
Indexing and Slicing
t = torch.arange(24).reshape(2, 3, 4).float()
print(t[0]) # first matrix [3, 4]
print(t[0, 1]) # second row [4]
print(t[0, 1, 2]) # scalar
print(t[:, 1:, :2]) # slicing
# Fancy indexing
indices = torch.tensor([0, 2])
print(t[:, indices, :])
# Boolean masking
mask = t > 10
print(t[mask]) # 1D tensor of elements > 10
# torch.where
a = torch.tensor([1.0, 2.0, 3.0, 4.0])
b = torch.tensor([10.0, 20.0, 30.0, 40.0])
print(torch.where(a > 2, b, a)) # tensor([ 1., 2., 30., 40.])
3. Automatic Differentiation (Autograd)
Autograd automatically builds a computational graph and computes gradients via backpropagation — the engine behind all neural network training.
Official docs: https://pytorch.org/docs/stable/autograd.html
requires_grad and Computational Graph
import torch
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)
z = x ** 2 + 2 * x * y + y ** 2 # (x + y)^2 = 49
z.backward()
# dz/dx = 2x + 2y = 2*3 + 2*4 = 14
print(f"dz/dx = {x.grad}") # 14.0
print(f"dz/dy = {y.grad}") # 14.0
Gradients for Multi-dimensional Tensors
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
z = y.sum()
z.backward()
print(f"x.grad: {x.grad}") # [2, 4, 6] (dz/dx = 2x)
# Non-scalar backward with gradient argument
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
grad_output = torch.ones(3)
y.backward(gradient=grad_output)
print(f"x.grad: {x.grad}") # [2, 4, 6]
Gradient Control
x = torch.tensor(2.0, requires_grad=True)
for i in range(3):
y = x ** 2
y.backward()
print(f"iteration {i}: x.grad = {x.grad}")
x.grad.zero_() # IMPORTANT: reset gradient every step
# no_grad: disable gradient tracking for inference
with torch.no_grad():
y = x ** 2
print(f"y.requires_grad: {y.requires_grad}") # False
# detach: separate tensor from the graph
x = torch.tensor([1.0, 2.0], requires_grad=True)
z = (x * 2).detach()
print(f"z.requires_grad: {z.requires_grad}") # False
Higher-order Gradients
x = torch.tensor(3.0, requires_grad=True)
y = x ** 4
# First derivative: dy/dx = 4x^3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"First derivative: {dy_dx}") # 108
# Second derivative: d2y/dx2 = 12x^2
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"Second derivative: {d2y_dx2}") # 108
4. nn.Module — The Foundation of Neural Networks
torch.nn.Module is the base class for all PyTorch models. Every layer, activation function, and complete model inherits from it.
Official docs: https://pytorch.org/docs/stable/nn.html
import torch
import torch.nn as nn
class SimpleModel(nn.Module):
def __init__(self, in_features, hidden_size, out_features):
super().__init__()
# Layers are automatically registered as parameters
self.fc1 = nn.Linear(in_features, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, out_features)
self.dropout = nn.Dropout(p=0.5)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.dropout(x)
x = self.fc2(x)
return x
model = SimpleModel(784, 256, 10)
print(model)
# Count parameters
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total params: {total:,}")
print(f"Trainable params: {trainable:,}")
# Iterate named parameters
for name, param in model.named_parameters():
print(f"{name}: {param.shape}")
# Forward pass
x = torch.randn(32, 784)
output = model(x)
print(f"Output shape: {output.shape}") # [32, 10]
Sequential, ModuleList, ModuleDict
# Sequential: stack layers in order
seq_model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
# ModuleList: manage layers as a list
class ResidualNet(nn.Module):
def __init__(self, num_blocks, hidden_size):
super().__init__()
self.layers = nn.ModuleList([
nn.Linear(hidden_size, hidden_size)
for _ in range(num_blocks)
])
self.relu = nn.ReLU()
def forward(self, x):
for layer in self.layers:
x = self.relu(layer(x)) + x # residual connection
return x
# ModuleDict: manage layers as a dictionary
class MultiTaskModel(nn.Module):
def __init__(self):
super().__init__()
self.backbone = nn.Linear(784, 256)
self.heads = nn.ModuleDict({
'classification': nn.Linear(256, 10),
'regression': nn.Linear(256, 1)
})
def forward(self, x, task='classification'):
features = torch.relu(self.backbone(x))
return self.heads[task](features)
5. Linear Regression from Scratch
Linear regression is the simplest deep learning model. Building it from scratch solidifies understanding of the training loop.
import torch
import torch.nn as nn
import torch.optim as optim
torch.manual_seed(42)
n_samples = 200
# Generate synthetic data: y = 3x + 2 + noise
X = torch.linspace(-5, 5, n_samples).unsqueeze(1)
y_true = 3 * X + 2
y = y_true + torch.randn_like(y_true) * 0.5
class LinearRegression(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, x):
return self.linear(x)
model = LinearRegression()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
n_epochs = 1000
losses = []
for epoch in range(n_epochs):
# 1. Forward pass
y_pred = model(X)
# 2. Compute loss
loss = criterion(y_pred, y)
losses.append(loss.item())
# 3. Zero gradients (critical!)
optimizer.zero_grad()
# 4. Backward pass
loss.backward()
# 5. Update parameters
optimizer.step()
if (epoch + 1) % 200 == 0:
w = model.linear.weight.item()
b = model.linear.bias.item()
print(f"Epoch {epoch+1}: Loss={loss.item():.4f}, w={w:.4f}, b={b:.4f}")
print(f"\nLearned weight: {model.linear.weight.item():.4f} (true: 3.0)")
print(f"Learned bias: {model.linear.bias.item():.4f} (true: 2.0)")
6. Multi-Layer Perceptron (MLP) — MNIST Classification
Building a complete classification model on the MNIST handwritten digit dataset.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
BATCH_SIZE = 64
LEARNING_RATE = 0.001
N_EPOCHS = 10
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2)
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.network = nn.Sequential(
nn.Flatten(),
nn.Linear(784, 512),
nn.BatchNorm1d(512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 10)
)
def forward(self, x):
return self.network(x)
model = MLP().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
def train_epoch(model, loader, criterion, optimizer, device):
model.train()
total_loss = 0
correct = 0
total = 0
for data, target in loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
pred = output.argmax(dim=1)
correct += pred.eq(target).sum().item()
total += target.size(0)
return total_loss / len(loader), 100.0 * correct / total
def evaluate(model, loader, criterion, device):
model.eval()
total_loss = 0
correct = 0
total = 0
with torch.no_grad():
for data, target in loader:
data, target = data.to(device), target.to(device)
output = model(data)
total_loss += criterion(output, target).item()
pred = output.argmax(dim=1)
correct += pred.eq(target).sum().item()
total += target.size(0)
return total_loss / len(loader), 100.0 * correct / total
for epoch in range(N_EPOCHS):
train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, DEVICE)
test_loss, test_acc = evaluate(model, test_loader, criterion, DEVICE)
print(f"Epoch {epoch+1}/{N_EPOCHS} | "
f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | "
f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")
7. Convolutional Neural Network (CNN) — CIFAR-10 Classification
Implementing a VGG-style CNN for image classification on CIFAR-10.
import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2023, 0.1994, 0.2010))
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2023, 0.1994, 0.2010))
])
train_data = datasets.CIFAR10('./data', train=True, download=True, transform=transform_train)
test_data = datasets.CIFAR10('./data', train=False, transform=transform_test)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True, num_workers=4)
test_loader = DataLoader(test_data, batch_size=128, shuffle=False, num_workers=4)
CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
class CNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
# Block 1: 3 → 64, 32x32 → 16x16
nn.Conv2d(3, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Dropout2d(0.1),
# Block 2: 64 → 128, 16x16 → 8x8
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Dropout2d(0.2),
# Block 3: 128 → 256, 8x8 → 4x4
nn.Conv2d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(256 * 4 * 4, 1024),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(1024, 512),
nn.ReLU(inplace=True),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
model = CNN().to(DEVICE)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
8. Recurrent Neural Networks (RNN / LSTM) — Sequential Data
RNNs and LSTMs excel at time-series data and natural language processing tasks.
import torch
import torch.nn as nn
import numpy as np
class LSTMPredictor(nn.Module):
def __init__(self, input_size=1, hidden_size=64, num_layers=2,
output_size=1, dropout=0.2):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.lstm = nn.LSTM(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = True, # input: [batch, seq_len, features]
dropout = dropout if num_layers > 1 else 0,
bidirectional = False
)
self.fc = nn.Sequential(
nn.Linear(hidden_size, 32),
nn.ReLU(),
nn.Linear(32, output_size)
)
def forward(self, x):
batch_size = x.size(0)
h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
# out: [batch_size, seq_len, hidden_size]
out, (hn, cn) = self.lstm(x, (h0, c0))
# Use only the last time step
out = self.fc(out[:, -1, :])
return out
# Generate sine wave dataset
t = np.linspace(0, 100, 1000)
data = np.sin(0.5 * t) + 0.1 * np.random.randn(1000)
data = torch.FloatTensor(data).unsqueeze(1)
def create_sequences(data, seq_len=50):
X, y = [], []
for i in range(len(data) - seq_len):
X.append(data[i:i+seq_len])
y.append(data[i+seq_len])
return torch.stack(X), torch.stack(y)
X, y = create_sequences(data, seq_len=50)
print(f"X shape: {X.shape}") # [950, 50, 1]
print(f"y shape: {y.shape}") # [950, 1]
# GRU — fewer parameters than LSTM
class GRUPredictor(nn.Module):
def __init__(self, input_size=1, hidden_size=64, num_layers=2):
super().__init__()
self.gru = nn.GRU(input_size, hidden_size, num_layers,
batch_first=True, dropout=0.2)
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
out, _ = self.gru(x)
return self.fc(out[:, -1, :])
9. Transformer — Multi-head Attention from Scratch
Implementing the key components of the "Attention Is All You Need" paper.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, num_heads=8, dropout=0.1):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
self.scale = math.sqrt(self.d_k)
def split_heads(self, x):
# [batch, seq, d_model] → [batch, num_heads, seq, d_k]
batch, seq, _ = x.shape
x = x.view(batch, seq, self.num_heads, self.d_k)
return x.transpose(1, 2)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
Q = self.split_heads(self.W_q(query))
K = self.split_heads(self.W_k(key))
V = self.split_heads(self.W_v(value))
scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = F.softmax(scores, dim=-1)
attn_weights = self.dropout(attn_weights)
context = torch.matmul(attn_weights, V)
context = context.transpose(1, 2).contiguous()
context = context.view(batch_size, -1, self.d_model)
output = self.W_o(context)
return output, attn_weights
class FeedForward(nn.Module):
def __init__(self, d_model=512, d_ff=2048, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
def forward(self, x):
return self.net(x)
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
self.ff = FeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
attn_out, _ = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_out))
ff_out = self.ff(x)
x = self.norm2(x + self.dropout(ff_out))
return x
class PositionalEncoding(nn.Module):
def __init__(self, d_model=512, max_len=5000, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + self.pe[:, :x.size(1)]
return self.dropout(x)
# Example usage
d_model = 512
encoder_layer = TransformerEncoderLayer(d_model=d_model, num_heads=8)
pos_enc = PositionalEncoding(d_model=d_model)
x = torch.randn(2, 10, d_model)
x = pos_enc(x)
output = encoder_layer(x)
print(f"Transformer Encoder output: {output.shape}") # [2, 10, 512]
10. Data Loading — Dataset and DataLoader
An efficient data pipeline is directly tied to training speed and flexibility.
Official tutorial: https://pytorch.org/tutorials/beginner/basics/intro.html
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from PIL import Image
import os
# Custom image Dataset
class CustomImageDataset(Dataset):
def __init__(self, csv_file, img_dir, transform=None):
self.annotations = pd.read_csv(csv_file)
self.img_dir = img_dir
self.transform = transform
def __len__(self):
return len(self.annotations)
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.annotations.iloc[idx, 0])
image = Image.open(img_path).convert('RGB')
label = int(self.annotations.iloc[idx, 1])
if self.transform:
image = self.transform(image)
return image, label
# Tabular Dataset
class TabularDataset(Dataset):
def __init__(self, X, y):
self.X = torch.FloatTensor(X)
self.y = torch.LongTensor(y)
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
dataset = TabularDataset(
X=np.random.randn(1000, 20),
y=np.random.randint(0, 5, 1000)
)
# Advanced DataLoader settings
advanced_loader = DataLoader(
dataset,
batch_size = 64,
shuffle = True,
num_workers = 4, # parallel data loading
pin_memory = True, # faster GPU transfer
drop_last = True, # drop incomplete last batch
prefetch_factor = 2,
persistent_workers = True
)
for batch_X, batch_y in advanced_loader:
print(f"batch X: {batch_X.shape}") # [64, 20]
print(f"batch y: {batch_y.shape}") # [64]
break
# WeightedRandomSampler: handle class imbalance
from torch.utils.data import WeightedRandomSampler
class_counts = [800, 150, 50]
weights = 1.0 / torch.tensor(class_counts, dtype=torch.float)
sample_weights = weights[dataset.y]
sampler = WeightedRandomSampler(
weights = sample_weights,
num_samples = len(dataset),
replacement = True
)
balanced_loader = DataLoader(dataset, batch_size=32, sampler=sampler)
11. Optimizers — SGD, Adam, AdamW
Official docs: https://pytorch.org/docs/stable/optim.html
import torch.optim as optim
model = nn.Linear(100, 10)
# SGD with momentum and weight decay
sgd = optim.SGD(
model.parameters(),
lr = 0.01,
momentum = 0.9,
weight_decay = 1e-4,
nesterov = True
)
# Adam: adaptive learning rates
adam = optim.Adam(
model.parameters(),
lr = 0.001,
betas = (0.9, 0.999),
eps = 1e-8,
weight_decay = 0
)
# AdamW: correct decoupled weight decay (recommended for Transformers)
adamw = optim.AdamW(
model.parameters(),
lr = 1e-3,
betas = (0.9, 0.999),
weight_decay = 0.01
)
# Per-parameter learning rates (useful for Transfer Learning)
optimizer = optim.Adam([
{'params': model.features.parameters(), 'lr': 1e-4},
{'params': model.classifier.parameters(), 'lr': 1e-3},
], lr=1e-3)
# Save and restore optimizer state
checkpoint = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': 10
}
torch.save(checkpoint, 'checkpoint.pt')
ckpt = torch.load('checkpoint.pt')
model.load_state_dict(ckpt['model'])
optimizer.load_state_dict(ckpt['optimizer'])
12. Learning Rate Schedulers
Using a scheduler almost always improves final performance compared to a fixed learning rate.
from torch.optim.lr_scheduler import (
StepLR, CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau,
CosineAnnealingWarmRestarts
)
model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)
# StepLR: multiply LR by gamma every step_size epochs
step_scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# CosineAnnealingLR: cosine decay
cosine_scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
# ReduceLROnPlateau: reduce when metric stops improving
plateau_scheduler = ReduceLROnPlateau(
optimizer,
mode = 'min',
factor = 0.5,
patience = 10,
min_lr = 1e-7,
verbose = True
)
# OneCycleLR: super-convergence
one_cycle = OneCycleLR(
optimizer,
max_lr = 0.01,
steps_per_epoch = 100,
epochs = 30,
pct_start = 0.3,
anneal_strategy = 'cos'
)
# CosineAnnealingWarmRestarts: periodic restarts
warm_restart = CosineAnnealingWarmRestarts(
optimizer,
T_0 = 10,
T_mult = 2,
eta_min = 1e-6
)
# Usage in training loop
for epoch in range(100):
train_loss = 0.5 # from actual training
cosine_scheduler.step()
plateau_scheduler.step(train_loss) # pass metric
print(f"Epoch {epoch+1}: LR = {optimizer.param_groups[0]['lr']:.6f}")
13. Regularization — Dropout, BatchNorm, LayerNorm
Regularization prevents overfitting and stabilizes training.
import torch.nn as nn
# Dropout: randomly zero out neurons during training
class DropoutDemo(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(100, 50)
self.dropout = nn.Dropout(p=0.5)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.dropout(x) # active during train(), inactive during eval()
return self.fc2(x)
# BatchNorm1d: normalize over the batch dimension (for FC layers)
bn_model = nn.Sequential(
nn.Linear(100, 64),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Linear(64, 32),
nn.BatchNorm1d(32),
nn.ReLU(),
nn.Linear(32, 10)
)
# BatchNorm2d: for 2D feature maps (after Conv layers)
cnn_bn = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
)
# LayerNorm: normalize over feature dimension (preferred for Transformers)
transformer_norm = nn.Sequential(
nn.Linear(512, 512),
nn.LayerNorm(512),
nn.ReLU()
)
# GroupNorm: a middle ground between BatchNorm and LayerNorm
group_norm = nn.GroupNorm(num_groups=8, num_channels=64)
# InstanceNorm: used in style transfer
instance_norm = nn.InstanceNorm2d(64)
# Summary:
# BatchNorm → CNN, batch-level statistics, depends on batch size
# LayerNorm → Transformers / RNNs, feature-level statistics
# GroupNorm → small batches where BatchNorm is unstable
# InstanceNorm → style transfer, image generation
14. Transfer Learning
Leveraging ImageNet-pretrained models to achieve high performance with limited data.
import torchvision.models as models
import torch.nn as nn
# Load pretrained models
resnet50 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
efficientnet = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.IMAGENET1K_V1)
vit = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)
# Strategy 1: Feature Extractor (freeze backbone)
for param in resnet50.parameters():
param.requires_grad = False
num_classes = 5
resnet50.fc = nn.Linear(resnet50.fc.in_features, num_classes)
trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"Trainable params: {trainable:,}") # ~2,050
# Strategy 2: Fine-tuning with layer-wise learning rates
resnet_ft = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
resnet_ft.fc = nn.Linear(resnet_ft.fc.in_features, num_classes)
optimizer = torch.optim.AdamW([
{'params': resnet_ft.layer1.parameters(), 'lr': 1e-5},
{'params': resnet_ft.layer2.parameters(), 'lr': 1e-5},
{'params': resnet_ft.layer3.parameters(), 'lr': 1e-4},
{'params': resnet_ft.layer4.parameters(), 'lr': 1e-4},
{'params': resnet_ft.fc.parameters(), 'lr': 1e-3},
], lr=1e-4, weight_decay=0.01)
# ImageNet normalization for preprocessing
from torchvision import transforms
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std =[0.229, 0.224, 0.225])
])
15. Saving and Loading Models
import torch
import torch.nn as nn
model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters())
# Option 1: state_dict (recommended)
torch.save(model.state_dict(), 'model_weights.pt')
loaded_model = nn.Linear(10, 5)
loaded_model.load_state_dict(torch.load('model_weights.pt', weights_only=True))
loaded_model.eval()
# Option 2: full model (not recommended — low portability)
torch.save(model, 'full_model.pt')
# Option 3: checkpoint — save full training state
def save_checkpoint(model, optimizer, scheduler, epoch, loss, path):
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict() if scheduler else None,
'loss': loss,
}, path)
def load_checkpoint(path, model, optimizer=None, scheduler=None):
ckpt = torch.load(path, map_location='cpu', weights_only=True)
model.load_state_dict(ckpt['model_state_dict'])
if optimizer:
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
if scheduler and ckpt['scheduler_state_dict']:
scheduler.load_state_dict(ckpt['scheduler_state_dict'])
return ckpt['epoch'], ckpt['loss']
# Load GPU model onto CPU
model_cpu = nn.Linear(10, 5)
model_cpu.load_state_dict(
torch.load('model_weights.pt', map_location='cpu', weights_only=True)
)
16. TorchScript and Model Deployment
Deploying trained models to production environments.
import torch
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(10, 5)
def forward(self, x):
return torch.relu(self.fc(x))
model = SimpleNet()
model.eval()
# Option 1: torch.jit.script — compile entire model
scripted_model = torch.jit.script(model)
scripted_model.save('model_scripted.pt')
loaded_scripted = torch.jit.load('model_scripted.pt')
x = torch.randn(4, 10)
with torch.no_grad():
out = loaded_scripted(x)
print(f"TorchScript output: {out.shape}")
# Option 2: torch.jit.trace — trace with example input
example_input = torch.randn(1, 10)
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')
# Option 3: ONNX export (cross-framework compatibility)
dummy_input = torch.randn(1, 10)
torch.onnx.export(
model,
dummy_input,
'model.onnx',
export_params = True,
opset_version = 17,
input_names = ['input'],
output_names = ['output'],
dynamic_axes = {
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
print("ONNX export complete")
# Option 4: torch.compile (PyTorch 2.0+)
compiled_model = torch.compile(model)
out = compiled_model(x)
print(f"torch.compile output: {out.shape}")
17. Distributed Training (DDP) — DistributedDataParallel
Using multiple GPUs to dramatically accelerate training.
Official tutorial: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
# train_ddp.py — run as a standalone script
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import datasets, transforms
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group(
backend = 'nccl',
rank = rank,
world_size = world_size
)
def cleanup():
dist.destroy_process_group()
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
return self.net(x.view(x.size(0), -1))
def train(rank, world_size, num_epochs=5):
print(f"Process {rank}/{world_size} starting")
setup(rank, world_size)
torch.cuda.set_device(rank)
device = torch.device(f'cuda:{rank}')
# Wrap model with DDP
model = SimpleModel().to(device)
ddp_model = DDP(model, device_ids=[rank])
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
# DistributedSampler ensures each process sees a unique data shard
sampler = DistributedSampler(
dataset,
num_replicas = world_size,
rank = rank,
shuffle = True
)
loader = DataLoader(
dataset,
batch_size = 128,
sampler = sampler,
num_workers = 4,
pin_memory = True
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.001)
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # shuffle differently each epoch
ddp_model.train()
total_loss = 0.0
for data, target in loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = ddp_model(data)
loss = criterion(output, target)
loss.backward() # gradients are automatically all-reduced
optimizer.step()
total_loss += loss.item()
if rank == 0:
avg_loss = total_loss / len(loader)
print(f"Epoch {epoch+1}: avg loss = {avg_loss:.4f}")
cleanup()
if __name__ == '__main__':
import torch.multiprocessing as mp
world_size = torch.cuda.device_count()
mp.spawn(train, args=(world_size, 5), nprocs=world_size, join=True)
Launching with torchrun
# Single node, 4 GPUs
torchrun --nproc_per_node=4 train_ddp.py
# Multi-node (node 0 of 2)
torchrun --nnodes=2 --nproc_per_node=4 \
--node_rank=0 \
--master_addr="192.168.1.100" \
--master_port=12355 \
train_ddp.py
DataParallel vs DistributedDataParallel
# DataParallel (DP): simple but inefficient
# - all gradients funnel through GPU 0 → bottleneck
# - multi-thread, not multi-process
model_dp = nn.DataParallel(model, device_ids=[0, 1, 2, 3])
# DistributedDataParallel (DDP): recommended
# - each GPU computes gradients independently
# - efficient all-reduce synchronization
# - faster than DP even on a single GPU (avoids Python GIL)
model_ddp = DDP(model, device_ids=[rank])
18. Advanced Techniques
Mixed Precision Training
from torch.cuda.amp import autocast, GradScaler
model = SimpleModel().to('cuda')
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()
for data, target in train_loader:
data, target = data.to('cuda'), target.to('cuda')
optimizer.zero_grad()
# Forward pass in FP16
with autocast():
output = model(data)
loss = criterion(output, target)
# Scaled backward pass
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Gradient Clipping
# Prevent exploding gradients
max_grad_norm = 1.0
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
Reproducibility
import random
import numpy as np
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
Conclusion
This guide covered the core concepts of PyTorch from the ground up, all the way to distributed training in production. Here is a recommended learning roadmap:
- Foundations: tensor operations, autograd, simple model implementations
- Intermediate: CNN, RNN, Transfer Learning, DataLoader optimization
- Advanced: Transformer, DDP, Mixed Precision Training
- Deployment: TorchScript, ONNX, torch.compile
The PyTorch ecosystem is continuously evolving. Check the official documentation and PyTorch blog for the latest features and updates.