Introduction
Of the two dominant deep learning frameworks — TensorFlow and PyTorch — **PyTorch** has become the preferred choice for researchers and engineers alike. Released by Facebook AI Research (now Meta AI) in 2016, PyTorch quickly became the standard for implementing academic papers and now surpasses TensorFlow in industrial adoption as well.
This guide targets readers with basic Python knowledge and walks through **everything from first contact with PyTorch all the way to distributed training**. Each section includes runnable code examples and links to the official documentation so you can read and practice simultaneously.
> Official docs: [https://pytorch.org/docs/stable/index.html](https://pytorch.org/docs/stable/index.html)
> Official tutorials: [https://pytorch.org/tutorials/](https://pytorch.org/tutorials/)
1. Environment Setup
Installing PyTorch
PyTorch can be installed via pip or conda. To use a GPU, select the package matching your CUDA version.
**pip install (CUDA 12.1):**
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
**conda install:**
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
**CPU-only install:**
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Verifying GPU Availability
Check PyTorch version
print(f"PyTorch version: {torch.__version__}")
Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
GPU count and info
if torch.cuda.is_available():
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
Apple Silicon (M1/M2/M3) MPS check
print(f"MPS available: {torch.backends.mps.is_available()}")
Auto-select device
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
print(f"Using device: {device}")
2. Tensor Basics
Tensors are the foundational data structure of PyTorch. They are similar to NumPy's ndarray but support **GPU computation** and **automatic differentiation**.
> Official docs: [https://pytorch.org/docs/stable/tensors.html](https://pytorch.org/docs/stable/tensors.html)
Creating Tensors
From data directly
t1 = torch.tensor([1, 2, 3, 4, 5])
print(f"1D tensor: {t1}, shape: {t1.shape}, dtype: {t1.dtype}")
2D tensor (matrix)
t2 = torch.tensor([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]])
print(f"2D tensor:\n{t2}, shape: {t2.shape}")
Special tensor creation
zeros = torch.zeros(3, 4) # all zeros
ones = torch.ones(2, 3) # all ones
rand = torch.rand(3, 3) # uniform [0, 1)
randn = torch.randn(3, 3) # standard normal
eye = torch.eye(4) # identity matrix
arange = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5) # 5 evenly spaced values
Create with same shape as existing tensor
t3 = torch.zeros_like(t2)
t4 = torch.ones_like(t2)
t5 = torch.rand_like(t2)
From NumPy array (shared memory)
np_arr = np.array([1.0, 2.0, 3.0])
t_from_np = torch.from_numpy(np_arr)
Tensor to NumPy (CPU only)
np_from_t = t1.numpy()
Tensor Attributes and Type Conversion
t = torch.rand(3, 4, 5)
print(f"shape: {t.shape}") # torch.Size([3, 4, 5])
print(f"ndim: {t.ndim}") # 3
print(f"dtype: {t.dtype}") # torch.float32
print(f"device: {t.device}") # cpu
print(f"numel: {t.numel()}") # 60 (total elements)
Type conversion
t_int = t.to(torch.int32)
t_long = t.long() # torch.int64
t_float = t.float() # torch.float32
t_double = t.double() # torch.float64
t_half = t.half() # torch.float16
Move to GPU
if torch.cuda.is_available():
t_gpu = t.to("cuda")
t_back = t_gpu.cpu() # back to CPU
Reshaping Tensors
t = torch.arange(24) # 1D tensor 0..23
t_2d = t.reshape(4, 6)
t_3d = t.reshape(2, 3, 4)
t_auto = t.reshape(6, -1) # -1 infers the size (6x4)
view: like reshape but requires contiguous memory
t_view = t.view(3, 8)
squeeze / unsqueeze
t = torch.zeros(1, 3, 1, 4)
t_sq = t.squeeze() # remove size-1 dims → [3, 4]
t_sq1 = t.squeeze(0) # remove dim 0 only → [3, 1, 4]
t_unsq = t_sq.unsqueeze(0) # add dim at 0 → [1, 3, 4]
transpose / permute
t = torch.rand(2, 3, 4)
t_T = t.transpose(0, 1) # [3, 2, 4]
t_perm = t.permute(2, 0, 1) # [4, 2, 3]
t_cont = t_perm.contiguous() # ensure contiguous memory
Tensor Operations
a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])
Element-wise arithmetic
print(a + b) # torch.add(a, b)
print(a - b) # torch.sub(a, b)
print(a * b) # Hadamard product
print(a / b)
print(a ** 2)
Matrix multiplication
matmul = a @ b # or torch.matmul(a, b)
mm = torch.mm(a, b) # 2D only
Reduction
t = torch.rand(3, 4)
print(t.sum())
print(t.mean())
print(t.max())
print(t.std())
print(t.sum(dim=0)) # sum along rows
print(t.sum(dim=1, keepdim=True))
argmax / argmin
print(t.argmax())
print(t.argmax(dim=1))
Broadcasting
a = torch.tensor([[1, 2, 3],
[4, 5, 6]]) # shape: [2, 3]
b = torch.tensor([10, 20, 30]) # shape: [3]
b is broadcast to [2, 3]
print(a + b)
tensor([[11, 22, 33],
[14, 25, 36]])
Column vector + row vector
col = torch.tensor([[1], [2], [3]]) # [3, 1]
row = torch.tensor([10, 20, 30]) # [3]
print(col + row) # [3, 3] outer-sum
Indexing and Slicing
t = torch.arange(24).reshape(2, 3, 4).float()
print(t[0]) # first matrix [3, 4]
print(t[0, 1]) # second row [4]
print(t[0, 1, 2]) # scalar
print(t[:, 1:, :2]) # slicing
Fancy indexing
indices = torch.tensor([0, 2])
print(t[:, indices, :])
Boolean masking
mask = t > 10
print(t[mask]) # 1D tensor of elements > 10
torch.where
a = torch.tensor([1.0, 2.0, 3.0, 4.0])
b = torch.tensor([10.0, 20.0, 30.0, 40.0])
print(torch.where(a > 2, b, a)) # tensor([ 1., 2., 30., 40.])
3. Automatic Differentiation (Autograd)
Autograd automatically builds a computational graph and computes gradients via backpropagation — the engine behind all neural network training.
> Official docs: [https://pytorch.org/docs/stable/autograd.html](https://pytorch.org/docs/stable/autograd.html)
requires_grad and Computational Graph
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)
z = x ** 2 + 2 * x * y + y ** 2 # (x + y)^2 = 49
z.backward()
dz/dx = 2x + 2y = 2*3 + 2*4 = 14
print(f"dz/dx = {x.grad}") # 14.0
print(f"dz/dy = {y.grad}") # 14.0
Gradients for Multi-dimensional Tensors
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
z = y.sum()
z.backward()
print(f"x.grad: {x.grad}") # [2, 4, 6] (dz/dx = 2x)
Non-scalar backward with gradient argument
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
grad_output = torch.ones(3)
y.backward(gradient=grad_output)
print(f"x.grad: {x.grad}") # [2, 4, 6]
Gradient Control
x = torch.tensor(2.0, requires_grad=True)
for i in range(3):
y = x ** 2
y.backward()
print(f"iteration {i}: x.grad = {x.grad}")
x.grad.zero_() # IMPORTANT: reset gradient every step
no_grad: disable gradient tracking for inference
with torch.no_grad():
y = x ** 2
print(f"y.requires_grad: {y.requires_grad}") # False
detach: separate tensor from the graph
x = torch.tensor([1.0, 2.0], requires_grad=True)
z = (x * 2).detach()
print(f"z.requires_grad: {z.requires_grad}") # False
Higher-order Gradients
x = torch.tensor(3.0, requires_grad=True)
y = x ** 4
First derivative: dy/dx = 4x^3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"First derivative: {dy_dx}") # 108
Second derivative: d2y/dx2 = 12x^2
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"Second derivative: {d2y_dx2}") # 108
4. nn.Module — The Foundation of Neural Networks
`torch.nn.Module` is the base class for all PyTorch models. Every layer, activation function, and complete model inherits from it.
> Official docs: [https://pytorch.org/docs/stable/nn.html](https://pytorch.org/docs/stable/nn.html)
class SimpleModel(nn.Module):
def __init__(self, in_features, hidden_size, out_features):
super().__init__()
Layers are automatically registered as parameters
self.fc1 = nn.Linear(in_features, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, out_features)
self.dropout = nn.Dropout(p=0.5)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.dropout(x)
x = self.fc2(x)
return x
model = SimpleModel(784, 256, 10)
print(model)
Count parameters
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total params: {total:,}")
print(f"Trainable params: {trainable:,}")
Iterate named parameters
for name, param in model.named_parameters():
print(f"{name}: {param.shape}")
Forward pass
x = torch.randn(32, 784)
output = model(x)
print(f"Output shape: {output.shape}") # [32, 10]
Sequential, ModuleList, ModuleDict
Sequential: stack layers in order
seq_model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
ModuleList: manage layers as a list
class ResidualNet(nn.Module):
def __init__(self, num_blocks, hidden_size):
super().__init__()
self.layers = nn.ModuleList([
nn.Linear(hidden_size, hidden_size)
for _ in range(num_blocks)
])
self.relu = nn.ReLU()
def forward(self, x):
for layer in self.layers:
x = self.relu(layer(x)) + x # residual connection
return x
ModuleDict: manage layers as a dictionary
class MultiTaskModel(nn.Module):
def __init__(self):
super().__init__()
self.backbone = nn.Linear(784, 256)
self.heads = nn.ModuleDict({
'classification': nn.Linear(256, 10),
'regression': nn.Linear(256, 1)
})
def forward(self, x, task='classification'):
features = torch.relu(self.backbone(x))
return self.heads[task](features)
5. Linear Regression from Scratch
Linear regression is the simplest deep learning model. Building it from scratch solidifies understanding of the training loop.
torch.manual_seed(42)
n_samples = 200
Generate synthetic data: y = 3x + 2 + noise
X = torch.linspace(-5, 5, n_samples).unsqueeze(1)
y_true = 3 * X + 2
y = y_true + torch.randn_like(y_true) * 0.5
class LinearRegression(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, x):
return self.linear(x)
model = LinearRegression()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
n_epochs = 1000
losses = []
for epoch in range(n_epochs):
1. Forward pass
y_pred = model(X)
2. Compute loss
loss = criterion(y_pred, y)
losses.append(loss.item())
3. Zero gradients (critical!)
optimizer.zero_grad()
4. Backward pass
loss.backward()
5. Update parameters
optimizer.step()
if (epoch + 1) % 200 == 0:
w = model.linear.weight.item()
b = model.linear.bias.item()
print(f"Epoch {epoch+1}: Loss={loss.item():.4f}, w={w:.4f}, b={b:.4f}")
print(f"\nLearned weight: {model.linear.weight.item():.4f} (true: 3.0)")
print(f"Learned bias: {model.linear.bias.item():.4f} (true: 2.0)")
6. Multi-Layer Perceptron (MLP) — MNIST Classification
Building a complete classification model on the MNIST handwritten digit dataset.
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
BATCH_SIZE = 64
LEARNING_RATE = 0.001
N_EPOCHS = 10
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2)
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.network = nn.Sequential(
nn.Flatten(),
nn.Linear(784, 512),
nn.BatchNorm1d(512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 10)
)
def forward(self, x):
return self.network(x)
model = MLP().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
def train_epoch(model, loader, criterion, optimizer, device):
model.train()
total_loss = 0
correct = 0
total = 0
for data, target in loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
pred = output.argmax(dim=1)
correct += pred.eq(target).sum().item()
total += target.size(0)
return total_loss / len(loader), 100.0 * correct / total
def evaluate(model, loader, criterion, device):
model.eval()
total_loss = 0
correct = 0
total = 0
with torch.no_grad():
for data, target in loader:
data, target = data.to(device), target.to(device)
output = model(data)
total_loss += criterion(output, target).item()
pred = output.argmax(dim=1)
correct += pred.eq(target).sum().item()
total += target.size(0)
return total_loss / len(loader), 100.0 * correct / total
for epoch in range(N_EPOCHS):
train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, DEVICE)
test_loss, test_acc = evaluate(model, test_loader, criterion, DEVICE)
print(f"Epoch {epoch+1}/{N_EPOCHS} | "
f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | "
f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")
7. Convolutional Neural Network (CNN) — CIFAR-10 Classification
Implementing a VGG-style CNN for image classification on CIFAR-10.
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2023, 0.1994, 0.2010))
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2023, 0.1994, 0.2010))
])
train_data = datasets.CIFAR10('./data', train=True, download=True, transform=transform_train)
test_data = datasets.CIFAR10('./data', train=False, transform=transform_test)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True, num_workers=4)
test_loader = DataLoader(test_data, batch_size=128, shuffle=False, num_workers=4)
CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
class CNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
Block 1: 3 → 64, 32x32 → 16x16
nn.Conv2d(3, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Dropout2d(0.1),
Block 2: 64 → 128, 16x16 → 8x8
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Dropout2d(0.2),
Block 3: 128 → 256, 8x8 → 4x4
nn.Conv2d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(256 * 4 * 4, 1024),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(1024, 512),
nn.ReLU(inplace=True),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
model = CNN().to(DEVICE)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
8. Recurrent Neural Networks (RNN / LSTM) — Sequential Data
RNNs and LSTMs excel at time-series data and natural language processing tasks.
class LSTMPredictor(nn.Module):
def __init__(self, input_size=1, hidden_size=64, num_layers=2,
output_size=1, dropout=0.2):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.lstm = nn.LSTM(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = True, # input: [batch, seq_len, features]
dropout = dropout if num_layers > 1 else 0,
bidirectional = False
)
self.fc = nn.Sequential(
nn.Linear(hidden_size, 32),
nn.ReLU(),
nn.Linear(32, output_size)
)
def forward(self, x):
batch_size = x.size(0)
h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
out: [batch_size, seq_len, hidden_size]
out, (hn, cn) = self.lstm(x, (h0, c0))
Use only the last time step
out = self.fc(out[:, -1, :])
return out
Generate sine wave dataset
t = np.linspace(0, 100, 1000)
data = np.sin(0.5 * t) + 0.1 * np.random.randn(1000)
data = torch.FloatTensor(data).unsqueeze(1)
def create_sequences(data, seq_len=50):
X, y = [], []
for i in range(len(data) - seq_len):
X.append(data[i:i+seq_len])
y.append(data[i+seq_len])
return torch.stack(X), torch.stack(y)
X, y = create_sequences(data, seq_len=50)
print(f"X shape: {X.shape}") # [950, 50, 1]
print(f"y shape: {y.shape}") # [950, 1]
GRU — fewer parameters than LSTM
class GRUPredictor(nn.Module):
def __init__(self, input_size=1, hidden_size=64, num_layers=2):
super().__init__()
self.gru = nn.GRU(input_size, hidden_size, num_layers,
batch_first=True, dropout=0.2)
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
out, _ = self.gru(x)
return self.fc(out[:, -1, :])
9. Transformer — Multi-head Attention from Scratch
Implementing the key components of the "Attention Is All You Need" paper.
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, num_heads=8, dropout=0.1):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
self.scale = math.sqrt(self.d_k)
def split_heads(self, x):
[batch, seq, d_model] → [batch, num_heads, seq, d_k]
batch, seq, _ = x.shape
x = x.view(batch, seq, self.num_heads, self.d_k)
return x.transpose(1, 2)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
Q = self.split_heads(self.W_q(query))
K = self.split_heads(self.W_k(key))
V = self.split_heads(self.W_v(value))
scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = F.softmax(scores, dim=-1)
attn_weights = self.dropout(attn_weights)
context = torch.matmul(attn_weights, V)
context = context.transpose(1, 2).contiguous()
context = context.view(batch_size, -1, self.d_model)
output = self.W_o(context)
return output, attn_weights
class FeedForward(nn.Module):
def __init__(self, d_model=512, d_ff=2048, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
def forward(self, x):
return self.net(x)
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
self.ff = FeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
attn_out, _ = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_out))
ff_out = self.ff(x)
x = self.norm2(x + self.dropout(ff_out))
return x
class PositionalEncoding(nn.Module):
def __init__(self, d_model=512, max_len=5000, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + self.pe[:, :x.size(1)]
return self.dropout(x)
Example usage
d_model = 512
encoder_layer = TransformerEncoderLayer(d_model=d_model, num_heads=8)
pos_enc = PositionalEncoding(d_model=d_model)
x = torch.randn(2, 10, d_model)
x = pos_enc(x)
output = encoder_layer(x)
print(f"Transformer Encoder output: {output.shape}") # [2, 10, 512]
10. Data Loading — Dataset and DataLoader
An efficient data pipeline is directly tied to training speed and flexibility.
> Official tutorial: [https://pytorch.org/tutorials/beginner/basics/intro.html](https://pytorch.org/tutorials/beginner/basics/intro.html)
from torch.utils.data import Dataset, DataLoader
from PIL import Image
Custom image Dataset
class CustomImageDataset(Dataset):
def __init__(self, csv_file, img_dir, transform=None):
self.annotations = pd.read_csv(csv_file)
self.img_dir = img_dir
self.transform = transform
def __len__(self):
return len(self.annotations)
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.annotations.iloc[idx, 0])
image = Image.open(img_path).convert('RGB')
label = int(self.annotations.iloc[idx, 1])
if self.transform:
image = self.transform(image)
return image, label
Tabular Dataset
class TabularDataset(Dataset):
def __init__(self, X, y):
self.X = torch.FloatTensor(X)
self.y = torch.LongTensor(y)
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
dataset = TabularDataset(
X=np.random.randn(1000, 20),
y=np.random.randint(0, 5, 1000)
)
Advanced DataLoader settings
advanced_loader = DataLoader(
dataset,
batch_size = 64,
shuffle = True,
num_workers = 4, # parallel data loading
pin_memory = True, # faster GPU transfer
drop_last = True, # drop incomplete last batch
prefetch_factor = 2,
persistent_workers = True
)
for batch_X, batch_y in advanced_loader:
print(f"batch X: {batch_X.shape}") # [64, 20]
print(f"batch y: {batch_y.shape}") # [64]
break
WeightedRandomSampler: handle class imbalance
from torch.utils.data import WeightedRandomSampler
class_counts = [800, 150, 50]
weights = 1.0 / torch.tensor(class_counts, dtype=torch.float)
sample_weights = weights[dataset.y]
sampler = WeightedRandomSampler(
weights = sample_weights,
num_samples = len(dataset),
replacement = True
)
balanced_loader = DataLoader(dataset, batch_size=32, sampler=sampler)
11. Optimizers — SGD, Adam, AdamW
> Official docs: [https://pytorch.org/docs/stable/optim.html](https://pytorch.org/docs/stable/optim.html)
model = nn.Linear(100, 10)
SGD with momentum and weight decay
sgd = optim.SGD(
model.parameters(),
lr = 0.01,
momentum = 0.9,
weight_decay = 1e-4,
nesterov = True
)
Adam: adaptive learning rates
adam = optim.Adam(
model.parameters(),
lr = 0.001,
betas = (0.9, 0.999),
eps = 1e-8,
weight_decay = 0
)
AdamW: correct decoupled weight decay (recommended for Transformers)
adamw = optim.AdamW(
model.parameters(),
lr = 1e-3,
betas = (0.9, 0.999),
weight_decay = 0.01
)
Per-parameter learning rates (useful for Transfer Learning)
optimizer = optim.Adam([
{'params': model.features.parameters(), 'lr': 1e-4},
{'params': model.classifier.parameters(), 'lr': 1e-3},
], lr=1e-3)
Save and restore optimizer state
checkpoint = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': 10
}
torch.save(checkpoint, 'checkpoint.pt')
ckpt = torch.load('checkpoint.pt')
model.load_state_dict(ckpt['model'])
optimizer.load_state_dict(ckpt['optimizer'])
12. Learning Rate Schedulers
Using a scheduler almost always improves final performance compared to a fixed learning rate.
from torch.optim.lr_scheduler import (
StepLR, CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau,
CosineAnnealingWarmRestarts
)
model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)
StepLR: multiply LR by gamma every step_size epochs
step_scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
CosineAnnealingLR: cosine decay
cosine_scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
ReduceLROnPlateau: reduce when metric stops improving
plateau_scheduler = ReduceLROnPlateau(
optimizer,
mode = 'min',
factor = 0.5,
patience = 10,
min_lr = 1e-7,
verbose = True
)
OneCycleLR: super-convergence
one_cycle = OneCycleLR(
optimizer,
max_lr = 0.01,
steps_per_epoch = 100,
epochs = 30,
pct_start = 0.3,
anneal_strategy = 'cos'
)
CosineAnnealingWarmRestarts: periodic restarts
warm_restart = CosineAnnealingWarmRestarts(
optimizer,
T_0 = 10,
T_mult = 2,
eta_min = 1e-6
)
Usage in training loop
for epoch in range(100):
train_loss = 0.5 # from actual training
cosine_scheduler.step()
plateau_scheduler.step(train_loss) # pass metric
print(f"Epoch {epoch+1}: LR = {optimizer.param_groups[0]['lr']:.6f}")
13. Regularization — Dropout, BatchNorm, LayerNorm
Regularization prevents overfitting and stabilizes training.
Dropout: randomly zero out neurons during training
class DropoutDemo(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(100, 50)
self.dropout = nn.Dropout(p=0.5)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.dropout(x) # active during train(), inactive during eval()
return self.fc2(x)
BatchNorm1d: normalize over the batch dimension (for FC layers)
bn_model = nn.Sequential(
nn.Linear(100, 64),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Linear(64, 32),
nn.BatchNorm1d(32),
nn.ReLU(),
nn.Linear(32, 10)
)
BatchNorm2d: for 2D feature maps (after Conv layers)
cnn_bn = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
)
LayerNorm: normalize over feature dimension (preferred for Transformers)
transformer_norm = nn.Sequential(
nn.Linear(512, 512),
nn.LayerNorm(512),
nn.ReLU()
)
GroupNorm: a middle ground between BatchNorm and LayerNorm
group_norm = nn.GroupNorm(num_groups=8, num_channels=64)
InstanceNorm: used in style transfer
instance_norm = nn.InstanceNorm2d(64)
Summary:
BatchNorm → CNN, batch-level statistics, depends on batch size
LayerNorm → Transformers / RNNs, feature-level statistics
GroupNorm → small batches where BatchNorm is unstable
InstanceNorm → style transfer, image generation
14. Transfer Learning
Leveraging ImageNet-pretrained models to achieve high performance with limited data.
Load pretrained models
resnet50 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
efficientnet = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.IMAGENET1K_V1)
vit = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)
Strategy 1: Feature Extractor (freeze backbone)
for param in resnet50.parameters():
param.requires_grad = False
num_classes = 5
resnet50.fc = nn.Linear(resnet50.fc.in_features, num_classes)
trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"Trainable params: {trainable:,}") # ~2,050
Strategy 2: Fine-tuning with layer-wise learning rates
resnet_ft = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
resnet_ft.fc = nn.Linear(resnet_ft.fc.in_features, num_classes)
optimizer = torch.optim.AdamW([
{'params': resnet_ft.layer1.parameters(), 'lr': 1e-5},
{'params': resnet_ft.layer2.parameters(), 'lr': 1e-5},
{'params': resnet_ft.layer3.parameters(), 'lr': 1e-4},
{'params': resnet_ft.layer4.parameters(), 'lr': 1e-4},
{'params': resnet_ft.fc.parameters(), 'lr': 1e-3},
], lr=1e-4, weight_decay=0.01)
ImageNet normalization for preprocessing
from torchvision import transforms
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std =[0.229, 0.224, 0.225])
])
15. Saving and Loading Models
model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters())
Option 1: state_dict (recommended)
torch.save(model.state_dict(), 'model_weights.pt')
loaded_model = nn.Linear(10, 5)
loaded_model.load_state_dict(torch.load('model_weights.pt', weights_only=True))
loaded_model.eval()
Option 2: full model (not recommended — low portability)
torch.save(model, 'full_model.pt')
Option 3: checkpoint — save full training state
def save_checkpoint(model, optimizer, scheduler, epoch, loss, path):
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict() if scheduler else None,
'loss': loss,
}, path)
def load_checkpoint(path, model, optimizer=None, scheduler=None):
ckpt = torch.load(path, map_location='cpu', weights_only=True)
model.load_state_dict(ckpt['model_state_dict'])
if optimizer:
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
if scheduler and ckpt['scheduler_state_dict']:
scheduler.load_state_dict(ckpt['scheduler_state_dict'])
return ckpt['epoch'], ckpt['loss']
Load GPU model onto CPU
model_cpu = nn.Linear(10, 5)
model_cpu.load_state_dict(
torch.load('model_weights.pt', map_location='cpu', weights_only=True)
)
16. TorchScript and Model Deployment
Deploying trained models to production environments.
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(10, 5)
def forward(self, x):
return torch.relu(self.fc(x))
model = SimpleNet()
model.eval()
Option 1: torch.jit.script — compile entire model
scripted_model = torch.jit.script(model)
scripted_model.save('model_scripted.pt')
loaded_scripted = torch.jit.load('model_scripted.pt')
x = torch.randn(4, 10)
with torch.no_grad():
out = loaded_scripted(x)
print(f"TorchScript output: {out.shape}")
Option 2: torch.jit.trace — trace with example input
example_input = torch.randn(1, 10)
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')
Option 3: ONNX export (cross-framework compatibility)
dummy_input = torch.randn(1, 10)
torch.onnx.export(
model,
dummy_input,
'model.onnx',
export_params = True,
opset_version = 17,
input_names = ['input'],
output_names = ['output'],
dynamic_axes = {
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
print("ONNX export complete")
Option 4: torch.compile (PyTorch 2.0+)
compiled_model = torch.compile(model)
out = compiled_model(x)
print(f"torch.compile output: {out.shape}")
17. Distributed Training (DDP) — DistributedDataParallel
Using multiple GPUs to dramatically accelerate training.
> Official tutorial: [https://pytorch.org/tutorials/intermediate/ddp_tutorial.html](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
train_ddp.py — run as a standalone script
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import datasets, transforms
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group(
backend = 'nccl',
rank = rank,
world_size = world_size
)
def cleanup():
dist.destroy_process_group()
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
return self.net(x.view(x.size(0), -1))
def train(rank, world_size, num_epochs=5):
print(f"Process {rank}/{world_size} starting")
setup(rank, world_size)
torch.cuda.set_device(rank)
device = torch.device(f'cuda:{rank}')
Wrap model with DDP
model = SimpleModel().to(device)
ddp_model = DDP(model, device_ids=[rank])
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
DistributedSampler ensures each process sees a unique data shard
sampler = DistributedSampler(
dataset,
num_replicas = world_size,
rank = rank,
shuffle = True
)
loader = DataLoader(
dataset,
batch_size = 128,
sampler = sampler,
num_workers = 4,
pin_memory = True
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.001)
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # shuffle differently each epoch
ddp_model.train()
total_loss = 0.0
for data, target in loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = ddp_model(data)
loss = criterion(output, target)
loss.backward() # gradients are automatically all-reduced
optimizer.step()
total_loss += loss.item()
if rank == 0:
avg_loss = total_loss / len(loader)
print(f"Epoch {epoch+1}: avg loss = {avg_loss:.4f}")
cleanup()
if __name__ == '__main__':
world_size = torch.cuda.device_count()
mp.spawn(train, args=(world_size, 5), nprocs=world_size, join=True)
Launching with torchrun
Single node, 4 GPUs
torchrun --nproc_per_node=4 train_ddp.py
Multi-node (node 0 of 2)
torchrun --nnodes=2 --nproc_per_node=4 \
--node_rank=0 \
--master_addr="192.168.1.100" \
--master_port=12355 \
train_ddp.py
DataParallel vs DistributedDataParallel
DataParallel (DP): simple but inefficient
- all gradients funnel through GPU 0 → bottleneck
- multi-thread, not multi-process
model_dp = nn.DataParallel(model, device_ids=[0, 1, 2, 3])
DistributedDataParallel (DDP): recommended
- each GPU computes gradients independently
- efficient all-reduce synchronization
- faster than DP even on a single GPU (avoids Python GIL)
model_ddp = DDP(model, device_ids=[rank])
18. Advanced Techniques
Mixed Precision Training
from torch.cuda.amp import autocast, GradScaler
model = SimpleModel().to('cuda')
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()
for data, target in train_loader:
data, target = data.to('cuda'), target.to('cuda')
optimizer.zero_grad()
Forward pass in FP16
with autocast():
output = model(data)
loss = criterion(output, target)
Scaled backward pass
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Gradient Clipping
Prevent exploding gradients
max_grad_norm = 1.0
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
Reproducibility
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
Conclusion
This guide covered the core concepts of PyTorch from the ground up, all the way to distributed training in production. Here is a recommended learning roadmap:
1. **Foundations**: tensor operations, autograd, simple model implementations
2. **Intermediate**: CNN, RNN, Transfer Learning, DataLoader optimization
3. **Advanced**: Transformer, DDP, Mixed Precision Training
4. **Deployment**: TorchScript, ONNX, torch.compile
The PyTorch ecosystem is continuously evolving. Check the official documentation and PyTorch blog for the latest features and updates.
References
- [PyTorch Official Docs](https://pytorch.org/docs/stable/index.html)
- [PyTorch Tutorials](https://pytorch.org/tutorials/)
- [Tensor API Reference](https://pytorch.org/docs/stable/tensors.html)
- [Autograd Mechanics](https://pytorch.org/docs/stable/autograd.html)
- [nn.Module Reference](https://pytorch.org/docs/stable/nn.html)
- [Optimizer Reference](https://pytorch.org/docs/stable/optim.html)
- [DDP Tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
- [Intro Tutorial](https://pytorch.org/tutorials/beginner/basics/intro.html)
현재 단락 (1/858)
Of the two dominant deep learning frameworks — TensorFlow and PyTorch — **PyTorch** has become the p...