PyTorch Complete Guide: Zero to Hero — From Tensors to Distributed Training

Introduction
1. Environment Setup
- Installing PyTorch
- Verifying GPU Availability
2. Tensor Basics
- Creating Tensors
- Tensor Attributes and Type Conversion
- Reshaping Tensors
- Tensor Operations
- Broadcasting
- Indexing and Slicing
3. Automatic Differentiation (Autograd)
- requires_grad and Computational Graph
- Gradients for Multi-dimensional Tensors
- Gradient Control
- Higher-order Gradients
4. nn.Module — The Foundation of Neural Networks
- Sequential, ModuleList, ModuleDict
5. Linear Regression from Scratch
6. Multi-Layer Perceptron (MLP) — MNIST Classification
7. Convolutional Neural Network (CNN) — CIFAR-10 Classification
8. Recurrent Neural Networks (RNN / LSTM) — Sequential Data
9. Transformer — Multi-head Attention from Scratch
10. Data Loading — Dataset and DataLoader
11. Optimizers — SGD, Adam, AdamW
12. Learning Rate Schedulers
13. Regularization — Dropout, BatchNorm, LayerNorm
14. Transfer Learning
15. Saving and Loading Models
16. TorchScript and Model Deployment
17. Distributed Training (DDP) — DistributedDataParallel
- Launching with torchrun
- DataParallel vs DistributedDataParallel
18. Advanced Techniques
- Mixed Precision Training
- Gradient Clipping
- Reproducibility
Conclusion
- References

Introduction

Of the two dominant deep learning frameworks — TensorFlow and PyTorch — PyTorch has become the preferred choice for researchers and engineers alike. Released by Facebook AI Research (now Meta AI) in 2016, PyTorch quickly became the standard for implementing academic papers and now surpasses TensorFlow in industrial adoption as well.

This guide targets readers with basic Python knowledge and walks through everything from first contact with PyTorch all the way to distributed training. Each section includes runnable code examples and links to the official documentation so you can read and practice simultaneously.

Official docs: https://pytorch.org/docs/stable/index.html Official tutorials: https://pytorch.org/tutorials/

1. Environment Setup

Installing PyTorch

PyTorch can be installed via pip or conda. To use a GPU, select the package matching your CUDA version.

pip install (CUDA 12.1):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

conda install:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

CPU-only install:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Verifying GPU Availability

import torch

# Check PyTorch version
print(f"PyTorch version: {torch.__version__}")

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")

# GPU count and info
if torch.cuda.is_available():
    print(f"GPU count: {torch.cuda.device_count()}")
    print(f"Current GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Apple Silicon (M1/M2/M3) MPS check
print(f"MPS available: {torch.backends.mps.is_available()}")

# Auto-select device
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print(f"Using device: {device}")

2. Tensor Basics

Tensors are the foundational data structure of PyTorch. They are similar to NumPy's ndarray but support GPU computation and automatic differentiation.

Official docs: https://pytorch.org/docs/stable/tensors.html

Creating Tensors

import torch
import numpy as np

# From data directly
t1 = torch.tensor([1, 2, 3, 4, 5])
print(f"1D tensor: {t1}, shape: {t1.shape}, dtype: {t1.dtype}")

# 2D tensor (matrix)
t2 = torch.tensor([[1.0, 2.0, 3.0],
                   [4.0, 5.0, 6.0]])
print(f"2D tensor:\n{t2}, shape: {t2.shape}")

# Special tensor creation
zeros = torch.zeros(3, 4)           # all zeros
ones = torch.ones(2, 3)             # all ones
rand = torch.rand(3, 3)             # uniform [0, 1)
randn = torch.randn(3, 3)           # standard normal
eye = torch.eye(4)                   # identity matrix
arange = torch.arange(0, 10, 2)     # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5)  # 5 evenly spaced values

# Create with same shape as existing tensor
t3 = torch.zeros_like(t2)
t4 = torch.ones_like(t2)
t5 = torch.rand_like(t2)

# From NumPy array (shared memory)
np_arr = np.array([1.0, 2.0, 3.0])
t_from_np = torch.from_numpy(np_arr)

# Tensor to NumPy (CPU only)
np_from_t = t1.numpy()

Tensor Attributes and Type Conversion

t = torch.rand(3, 4, 5)

print(f"shape: {t.shape}")      # torch.Size([3, 4, 5])
print(f"ndim: {t.ndim}")        # 3
print(f"dtype: {t.dtype}")      # torch.float32
print(f"device: {t.device}")    # cpu
print(f"numel: {t.numel()}")    # 60 (total elements)

# Type conversion
t_int    = t.to(torch.int32)
t_long   = t.long()     # torch.int64
t_float  = t.float()    # torch.float32
t_double = t.double()   # torch.float64
t_half   = t.half()     # torch.float16

# Move to GPU
if torch.cuda.is_available():
    t_gpu  = t.to("cuda")
    t_back = t_gpu.cpu()  # back to CPU

Reshaping Tensors

t = torch.arange(24)  # 1D tensor 0..23

t_2d   = t.reshape(4, 6)
t_3d   = t.reshape(2, 3, 4)
t_auto = t.reshape(6, -1)   # -1 infers the size (6x4)

# view: like reshape but requires contiguous memory
t_view = t.view(3, 8)

# squeeze / unsqueeze
t = torch.zeros(1, 3, 1, 4)
t_sq    = t.squeeze()         # remove size-1 dims → [3, 4]
t_sq1   = t.squeeze(0)        # remove dim 0 only → [3, 1, 4]
t_unsq  = t_sq.unsqueeze(0)   # add dim at 0 → [1, 3, 4]

# transpose / permute
t = torch.rand(2, 3, 4)
t_T    = t.transpose(0, 1)      # [3, 2, 4]
t_perm = t.permute(2, 0, 1)     # [4, 2, 3]
t_cont = t_perm.contiguous()    # ensure contiguous memory

Tensor Operations

a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

# Element-wise arithmetic
print(a + b)   # torch.add(a, b)
print(a - b)   # torch.sub(a, b)
print(a * b)   # Hadamard product
print(a / b)
print(a ** 2)

# Matrix multiplication
matmul = a @ b          # or torch.matmul(a, b)
mm     = torch.mm(a, b) # 2D only

# Reduction
t = torch.rand(3, 4)
print(t.sum())
print(t.mean())
print(t.max())
print(t.std())
print(t.sum(dim=0))              # sum along rows
print(t.sum(dim=1, keepdim=True))

# argmax / argmin
print(t.argmax())
print(t.argmax(dim=1))

Broadcasting

a = torch.tensor([[1, 2, 3],
                  [4, 5, 6]])   # shape: [2, 3]
b = torch.tensor([10, 20, 30])  # shape: [3]

# b is broadcast to [2, 3]
print(a + b)
# tensor([[11, 22, 33],
#         [14, 25, 36]])

# Column vector + row vector
col = torch.tensor([[1], [2], [3]])  # [3, 1]
row = torch.tensor([10, 20, 30])      # [3]
print(col + row)  # [3, 3] outer-sum

Indexing and Slicing

t = torch.arange(24).reshape(2, 3, 4).float()

print(t[0])          # first matrix [3, 4]
print(t[0, 1])       # second row [4]
print(t[0, 1, 2])    # scalar

print(t[:, 1:, :2])  # slicing

# Fancy indexing
indices = torch.tensor([0, 2])
print(t[:, indices, :])

# Boolean masking
mask   = t > 10
print(t[mask])       # 1D tensor of elements > 10

# torch.where
a = torch.tensor([1.0, 2.0, 3.0, 4.0])
b = torch.tensor([10.0, 20.0, 30.0, 40.0])
print(torch.where(a > 2, b, a))  # tensor([ 1.,  2., 30., 40.])

3. Automatic Differentiation (Autograd)

Autograd automatically builds a computational graph and computes gradients via backpropagation — the engine behind all neural network training.

Official docs: https://pytorch.org/docs/stable/autograd.html

requires_grad and Computational Graph

import torch

x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

z = x ** 2 + 2 * x * y + y ** 2  # (x + y)^2 = 49

z.backward()

# dz/dx = 2x + 2y = 2*3 + 2*4 = 14
print(f"dz/dx = {x.grad}")  # 14.0
print(f"dz/dy = {y.grad}")  # 14.0

Gradients for Multi-dimensional Tensors

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
z = y.sum()

z.backward()
print(f"x.grad: {x.grad}")  # [2, 4, 6]  (dz/dx = 2x)

# Non-scalar backward with gradient argument
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
grad_output = torch.ones(3)
y.backward(gradient=grad_output)
print(f"x.grad: {x.grad}")  # [2, 4, 6]

Gradient Control

x = torch.tensor(2.0, requires_grad=True)

for i in range(3):
    y = x ** 2
    y.backward()
    print(f"iteration {i}: x.grad = {x.grad}")
    x.grad.zero_()  # IMPORTANT: reset gradient every step

# no_grad: disable gradient tracking for inference
with torch.no_grad():
    y = x ** 2
    print(f"y.requires_grad: {y.requires_grad}")  # False

# detach: separate tensor from the graph
x = torch.tensor([1.0, 2.0], requires_grad=True)
z = (x * 2).detach()
print(f"z.requires_grad: {z.requires_grad}")  # False

Higher-order Gradients

x = torch.tensor(3.0, requires_grad=True)
y = x ** 4

# First derivative: dy/dx = 4x^3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"First derivative: {dy_dx}")   # 108

# Second derivative: d2y/dx2 = 12x^2
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"Second derivative: {d2y_dx2}")  # 108

4. nn.Module — The Foundation of Neural Networks

torch.nn.Module is the base class for all PyTorch models. Every layer, activation function, and complete model inherits from it.

Official docs: https://pytorch.org/docs/stable/nn.html

import torch
import torch.nn as nn

class SimpleModel(nn.Module):
    def __init__(self, in_features, hidden_size, out_features):
        super().__init__()
        # Layers are automatically registered as parameters
        self.fc1     = nn.Linear(in_features, hidden_size)
        self.relu    = nn.ReLU()
        self.fc2     = nn.Linear(hidden_size, out_features)
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

model = SimpleModel(784, 256, 10)
print(model)

# Count parameters
total     = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total params: {total:,}")
print(f"Trainable params: {trainable:,}")

# Iterate named parameters
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

# Forward pass
x      = torch.randn(32, 784)
output = model(x)
print(f"Output shape: {output.shape}")  # [32, 10]

Sequential, ModuleList, ModuleDict

# Sequential: stack layers in order
seq_model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# ModuleList: manage layers as a list
class ResidualNet(nn.Module):
    def __init__(self, num_blocks, hidden_size):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(hidden_size, hidden_size)
            for _ in range(num_blocks)
        ])
        self.relu = nn.ReLU()

    def forward(self, x):
        for layer in self.layers:
            x = self.relu(layer(x)) + x  # residual connection
        return x

# ModuleDict: manage layers as a dictionary
class MultiTaskModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.Linear(784, 256)
        self.heads    = nn.ModuleDict({
            'classification': nn.Linear(256, 10),
            'regression':     nn.Linear(256, 1)
        })

    def forward(self, x, task='classification'):
        features = torch.relu(self.backbone(x))
        return self.heads[task](features)

5. Linear Regression from Scratch

Linear regression is the simplest deep learning model. Building it from scratch solidifies understanding of the training loop.

import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(42)
n_samples = 200

# Generate synthetic data: y = 3x + 2 + noise
X      = torch.linspace(-5, 5, n_samples).unsqueeze(1)
y_true = 3 * X + 2
y      = y_true + torch.randn_like(y_true) * 0.5

class LinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

model     = LinearRegression()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

n_epochs = 1000
losses   = []

for epoch in range(n_epochs):
    # 1. Forward pass
    y_pred = model(X)

    # 2. Compute loss
    loss = criterion(y_pred, y)
    losses.append(loss.item())

    # 3. Zero gradients (critical!)
    optimizer.zero_grad()

    # 4. Backward pass
    loss.backward()

    # 5. Update parameters
    optimizer.step()

    if (epoch + 1) % 200 == 0:
        w = model.linear.weight.item()
        b = model.linear.bias.item()
        print(f"Epoch {epoch+1}: Loss={loss.item():.4f}, w={w:.4f}, b={b:.4f}")

print(f"\nLearned weight: {model.linear.weight.item():.4f} (true: 3.0)")
print(f"Learned bias:   {model.linear.bias.item():.4f}   (true: 2.0)")

6. Multi-Layer Perceptron (MLP) — MNIST Classification

Building a complete classification model on the MNIST handwritten digit dataset.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

BATCH_SIZE    = 64
LEARNING_RATE = 0.001
N_EPOCHS      = 10
DEVICE        = torch.device("cuda" if torch.cuda.is_available() else "cpu")

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST('./data', train=True,  download=True, transform=transform)
test_dataset  = datasets.MNIST('./data', train=False,                transform=transform)

train_loader  = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True,  num_workers=2)
test_loader   = DataLoader(test_dataset,  batch_size=BATCH_SIZE, shuffle=False, num_workers=2)

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Flatten(),
            nn.Linear(784, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.network(x)

model     = MLP().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct    = 0
    total      = 0

    for data, target in loader:
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss   = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        pred        = output.argmax(dim=1)
        correct    += pred.eq(target).sum().item()
        total      += target.size(0)

    return total_loss / len(loader), 100.0 * correct / total

def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss = 0
    correct    = 0
    total      = 0

    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.to(device)
            output      = model(data)
            total_loss += criterion(output, target).item()
            pred        = output.argmax(dim=1)
            correct    += pred.eq(target).sum().item()
            total      += target.size(0)

    return total_loss / len(loader), 100.0 * correct / total

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, DEVICE)
    test_loss,  test_acc  = evaluate(model, test_loader, criterion, DEVICE)
    print(f"Epoch {epoch+1}/{N_EPOCHS} | "
          f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}% | "
          f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")

7. Convolutional Neural Network (CNN) — CIFAR-10 Classification

Implementing a VGG-style CNN for image classification on CIFAR-10.

import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010))
])

train_data = datasets.CIFAR10('./data', train=True,  download=True, transform=transform_train)
test_data  = datasets.CIFAR10('./data', train=False, transform=transform_test)

train_loader = DataLoader(train_data, batch_size=128, shuffle=True,  num_workers=4)
test_loader  = DataLoader(test_data,  batch_size=128, shuffle=False, num_workers=4)

CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck']

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 3 → 64, 32x32 → 16x16
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.1),

            # Block 2: 64 → 128, 16x16 → 8x8
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.2),

            # Block 3: 128 → 256, 8x8 → 4x4
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(256 * 4 * 4, 1024),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(1024, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

model = CNN().to(DEVICE)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

8. Recurrent Neural Networks (RNN / LSTM) — Sequential Data

RNNs and LSTMs excel at time-series data and natural language processing tasks.

import torch
import torch.nn as nn
import numpy as np

class LSTMPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2,
                 output_size=1, dropout=0.2):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers  = num_layers

        self.lstm = nn.LSTM(
            input_size  = input_size,
            hidden_size = hidden_size,
            num_layers  = num_layers,
            batch_first = True,   # input: [batch, seq_len, features]
            dropout     = dropout if num_layers > 1 else 0,
            bidirectional = False
        )

        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 32),
            nn.ReLU(),
            nn.Linear(32, output_size)
        )

    def forward(self, x):
        batch_size = x.size(0)

        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)

        # out: [batch_size, seq_len, hidden_size]
        out, (hn, cn) = self.lstm(x, (h0, c0))

        # Use only the last time step
        out = self.fc(out[:, -1, :])
        return out

# Generate sine wave dataset
t    = np.linspace(0, 100, 1000)
data = np.sin(0.5 * t) + 0.1 * np.random.randn(1000)
data = torch.FloatTensor(data).unsqueeze(1)

def create_sequences(data, seq_len=50):
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i:i+seq_len])
        y.append(data[i+seq_len])
    return torch.stack(X), torch.stack(y)

X, y = create_sequences(data, seq_len=50)
print(f"X shape: {X.shape}")  # [950, 50, 1]
print(f"y shape: {y.shape}")  # [950, 1]

# GRU — fewer parameters than LSTM
class GRUPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2):
        super().__init__()
        self.gru = nn.GRU(input_size, hidden_size, num_layers,
                          batch_first=True, dropout=0.2)
        self.fc  = nn.Linear(hidden_size, 1)

    def forward(self, x):
        out, _ = self.gru(x)
        return self.fc(out[:, -1, :])

9. Transformer — Multi-head Attention from Scratch

Implementing the key components of the "Attention Is All You Need" paper.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model   = d_model
        self.num_heads = num_heads
        self.d_k       = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)
        self.scale   = math.sqrt(self.d_k)

    def split_heads(self, x):
        # [batch, seq, d_model] → [batch, num_heads, seq, d_k]
        batch, seq, _ = x.shape
        x = x.view(batch, seq, self.num_heads, self.d_k)
        return x.transpose(1, 2)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        Q = self.split_heads(self.W_q(query))
        K = self.split_heads(self.W_k(key))
        V = self.split_heads(self.W_v(value))

        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context = torch.matmul(attn_weights, V)
        context = context.transpose(1, 2).contiguous()
        context = context.view(batch_size, -1, self.d_model)

        output = self.W_o(context)
        return output, attn_weights

class FeedForward(nn.Module):
    def __init__(self, d_model=512, d_ff=2048, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )

    def forward(self, x):
        return self.net(x)

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.ff        = FeedForward(d_model, d_ff, dropout)
        self.norm1     = nn.LayerNorm(d_model)
        self.norm2     = nn.LayerNorm(d_model)
        self.dropout   = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        attn_out, _ = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_out))
        ff_out = self.ff(x)
        x = self.norm2(x + self.dropout(ff_out))
        return x

class PositionalEncoding(nn.Module):
    def __init__(self, d_model=512, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        pe       = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

# Example usage
d_model      = 512
encoder_layer = TransformerEncoderLayer(d_model=d_model, num_heads=8)
pos_enc       = PositionalEncoding(d_model=d_model)

x      = torch.randn(2, 10, d_model)
x      = pos_enc(x)
output = encoder_layer(x)
print(f"Transformer Encoder output: {output.shape}")  # [2, 10, 512]

10. Data Loading — Dataset and DataLoader

An efficient data pipeline is directly tied to training speed and flexibility.

Official tutorial: https://pytorch.org/tutorials/beginner/basics/intro.html

import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from PIL import Image
import os

# Custom image Dataset
class CustomImageDataset(Dataset):
    def __init__(self, csv_file, img_dir, transform=None):
        self.annotations = pd.read_csv(csv_file)
        self.img_dir     = img_dir
        self.transform   = transform

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.annotations.iloc[idx, 0])
        image    = Image.open(img_path).convert('RGB')
        label    = int(self.annotations.iloc[idx, 1])
        if self.transform:
            image = self.transform(image)
        return image, label

# Tabular Dataset
class TabularDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.FloatTensor(X)
        self.y = torch.LongTensor(y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

dataset = TabularDataset(
    X=np.random.randn(1000, 20),
    y=np.random.randint(0, 5, 1000)
)

# Advanced DataLoader settings
advanced_loader = DataLoader(
    dataset,
    batch_size       = 64,
    shuffle          = True,
    num_workers      = 4,       # parallel data loading
    pin_memory       = True,    # faster GPU transfer
    drop_last        = True,    # drop incomplete last batch
    prefetch_factor  = 2,
    persistent_workers = True
)

for batch_X, batch_y in advanced_loader:
    print(f"batch X: {batch_X.shape}")  # [64, 20]
    print(f"batch y: {batch_y.shape}")  # [64]
    break

# WeightedRandomSampler: handle class imbalance
from torch.utils.data import WeightedRandomSampler

class_counts   = [800, 150, 50]
weights        = 1.0 / torch.tensor(class_counts, dtype=torch.float)
sample_weights = weights[dataset.y]

sampler = WeightedRandomSampler(
    weights     = sample_weights,
    num_samples = len(dataset),
    replacement = True
)

balanced_loader = DataLoader(dataset, batch_size=32, sampler=sampler)

11. Optimizers — SGD, Adam, AdamW

Official docs: https://pytorch.org/docs/stable/optim.html

import torch.optim as optim

model = nn.Linear(100, 10)

# SGD with momentum and weight decay
sgd = optim.SGD(
    model.parameters(),
    lr           = 0.01,
    momentum     = 0.9,
    weight_decay = 1e-4,
    nesterov     = True
)

# Adam: adaptive learning rates
adam = optim.Adam(
    model.parameters(),
    lr           = 0.001,
    betas        = (0.9, 0.999),
    eps          = 1e-8,
    weight_decay = 0
)

# AdamW: correct decoupled weight decay (recommended for Transformers)
adamw = optim.AdamW(
    model.parameters(),
    lr           = 1e-3,
    betas        = (0.9, 0.999),
    weight_decay = 0.01
)

# Per-parameter learning rates (useful for Transfer Learning)
optimizer = optim.Adam([
    {'params': model.features.parameters(),   'lr': 1e-4},
    {'params': model.classifier.parameters(), 'lr': 1e-3},
], lr=1e-3)

# Save and restore optimizer state
checkpoint = {
    'model':     model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch':     10
}
torch.save(checkpoint, 'checkpoint.pt')

ckpt = torch.load('checkpoint.pt')
model.load_state_dict(ckpt['model'])
optimizer.load_state_dict(ckpt['optimizer'])

12. Learning Rate Schedulers

Using a scheduler almost always improves final performance compared to a fixed learning rate.

from torch.optim.lr_scheduler import (
    StepLR, CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau,
    CosineAnnealingWarmRestarts
)

model     = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)

# StepLR: multiply LR by gamma every step_size epochs
step_scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# CosineAnnealingLR: cosine decay
cosine_scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# ReduceLROnPlateau: reduce when metric stops improving
plateau_scheduler = ReduceLROnPlateau(
    optimizer,
    mode      = 'min',
    factor    = 0.5,
    patience  = 10,
    min_lr    = 1e-7,
    verbose   = True
)

# OneCycleLR: super-convergence
one_cycle = OneCycleLR(
    optimizer,
    max_lr           = 0.01,
    steps_per_epoch  = 100,
    epochs           = 30,
    pct_start        = 0.3,
    anneal_strategy  = 'cos'
)

# CosineAnnealingWarmRestarts: periodic restarts
warm_restart = CosineAnnealingWarmRestarts(
    optimizer,
    T_0     = 10,
    T_mult  = 2,
    eta_min = 1e-6
)

# Usage in training loop
for epoch in range(100):
    train_loss = 0.5  # from actual training

    cosine_scheduler.step()
    plateau_scheduler.step(train_loss)  # pass metric

    print(f"Epoch {epoch+1}: LR = {optimizer.param_groups[0]['lr']:.6f}")

13. Regularization — Dropout, BatchNorm, LayerNorm

Regularization prevents overfitting and stabilizes training.

import torch.nn as nn

# Dropout: randomly zero out neurons during training
class DropoutDemo(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1     = nn.Linear(100, 50)
        self.dropout = nn.Dropout(p=0.5)
        self.fc2     = nn.Linear(50, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)   # active during train(), inactive during eval()
        return self.fc2(x)

# BatchNorm1d: normalize over the batch dimension (for FC layers)
bn_model = nn.Sequential(
    nn.Linear(100, 64),
    nn.BatchNorm1d(64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.BatchNorm1d(32),
    nn.ReLU(),
    nn.Linear(32, 10)
)

# BatchNorm2d: for 2D feature maps (after Conv layers)
cnn_bn = nn.Sequential(
    nn.Conv2d(3, 32, 3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
)

# LayerNorm: normalize over feature dimension (preferred for Transformers)
transformer_norm = nn.Sequential(
    nn.Linear(512, 512),
    nn.LayerNorm(512),
    nn.ReLU()
)

# GroupNorm: a middle ground between BatchNorm and LayerNorm
group_norm = nn.GroupNorm(num_groups=8, num_channels=64)

# InstanceNorm: used in style transfer
instance_norm = nn.InstanceNorm2d(64)

# Summary:
# BatchNorm   → CNN, batch-level statistics, depends on batch size
# LayerNorm   → Transformers / RNNs, feature-level statistics
# GroupNorm   → small batches where BatchNorm is unstable
# InstanceNorm → style transfer, image generation

14. Transfer Learning

Leveraging ImageNet-pretrained models to achieve high performance with limited data.

import torchvision.models as models
import torch.nn as nn

# Load pretrained models
resnet50    = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
efficientnet = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.IMAGENET1K_V1)
vit         = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)

# Strategy 1: Feature Extractor (freeze backbone)
for param in resnet50.parameters():
    param.requires_grad = False

num_classes  = 5
resnet50.fc  = nn.Linear(resnet50.fc.in_features, num_classes)

trainable = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
print(f"Trainable params: {trainable:,}")  # ~2,050

# Strategy 2: Fine-tuning with layer-wise learning rates
resnet_ft    = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
resnet_ft.fc = nn.Linear(resnet_ft.fc.in_features, num_classes)

optimizer = torch.optim.AdamW([
    {'params': resnet_ft.layer1.parameters(), 'lr': 1e-5},
    {'params': resnet_ft.layer2.parameters(), 'lr': 1e-5},
    {'params': resnet_ft.layer3.parameters(), 'lr': 1e-4},
    {'params': resnet_ft.layer4.parameters(), 'lr': 1e-4},
    {'params': resnet_ft.fc.parameters(),     'lr': 1e-3},
], lr=1e-4, weight_decay=0.01)

# ImageNet normalization for preprocessing
from torchvision import transforms

preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std =[0.229, 0.224, 0.225])
])

15. Saving and Loading Models

import torch
import torch.nn as nn

model     = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters())

# Option 1: state_dict (recommended)
torch.save(model.state_dict(), 'model_weights.pt')

loaded_model = nn.Linear(10, 5)
loaded_model.load_state_dict(torch.load('model_weights.pt', weights_only=True))
loaded_model.eval()

# Option 2: full model (not recommended — low portability)
torch.save(model, 'full_model.pt')

# Option 3: checkpoint — save full training state
def save_checkpoint(model, optimizer, scheduler, epoch, loss, path):
    torch.save({
        'epoch':                epoch,
        'model_state_dict':     model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict() if scheduler else None,
        'loss':                 loss,
    }, path)

def load_checkpoint(path, model, optimizer=None, scheduler=None):
    ckpt = torch.load(path, map_location='cpu', weights_only=True)
    model.load_state_dict(ckpt['model_state_dict'])
    if optimizer:
        optimizer.load_state_dict(ckpt['optimizer_state_dict'])
    if scheduler and ckpt['scheduler_state_dict']:
        scheduler.load_state_dict(ckpt['scheduler_state_dict'])
    return ckpt['epoch'], ckpt['loss']

# Load GPU model onto CPU
model_cpu = nn.Linear(10, 5)
model_cpu.load_state_dict(
    torch.load('model_weights.pt', map_location='cpu', weights_only=True)
)

16. TorchScript and Model Deployment

Deploying trained models to production environments.

import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 5)

    def forward(self, x):
        return torch.relu(self.fc(x))

model = SimpleNet()
model.eval()

# Option 1: torch.jit.script — compile entire model
scripted_model = torch.jit.script(model)
scripted_model.save('model_scripted.pt')
loaded_scripted = torch.jit.load('model_scripted.pt')

x = torch.randn(4, 10)
with torch.no_grad():
    out = loaded_scripted(x)
print(f"TorchScript output: {out.shape}")

# Option 2: torch.jit.trace — trace with example input
example_input = torch.randn(1, 10)
traced_model  = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')

# Option 3: ONNX export (cross-framework compatibility)
dummy_input = torch.randn(1, 10)
torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    export_params  = True,
    opset_version  = 17,
    input_names    = ['input'],
    output_names   = ['output'],
    dynamic_axes   = {
        'input':  {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)
print("ONNX export complete")

# Option 4: torch.compile (PyTorch 2.0+)
compiled_model = torch.compile(model)
out = compiled_model(x)
print(f"torch.compile output: {out.shape}")

17. Distributed Training (DDP) — DistributedDataParallel

Using multiple GPUs to dramatically accelerate training.

Official tutorial: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

# train_ddp.py — run as a standalone script
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import datasets, transforms

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group(
        backend    = 'nccl',
        rank       = rank,
        world_size = world_size
    )

def cleanup():
    dist.destroy_process_group()

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.net(x.view(x.size(0), -1))

def train(rank, world_size, num_epochs=5):
    print(f"Process {rank}/{world_size} starting")
    setup(rank, world_size)

    torch.cuda.set_device(rank)
    device = torch.device(f'cuda:{rank}')

    # Wrap model with DDP
    model     = SimpleModel().to(device)
    ddp_model = DDP(model, device_ids=[rank])

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)

    # DistributedSampler ensures each process sees a unique data shard
    sampler = DistributedSampler(
        dataset,
        num_replicas = world_size,
        rank         = rank,
        shuffle      = True
    )

    loader = DataLoader(
        dataset,
        batch_size   = 128,
        sampler      = sampler,
        num_workers  = 4,
        pin_memory   = True
    )

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.001)

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # shuffle differently each epoch
        ddp_model.train()
        total_loss = 0.0

        for data, target in loader:
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = ddp_model(data)
            loss   = criterion(output, target)
            loss.backward()   # gradients are automatically all-reduced
            optimizer.step()
            total_loss += loss.item()

        if rank == 0:
            avg_loss = total_loss / len(loader)
            print(f"Epoch {epoch+1}: avg loss = {avg_loss:.4f}")

    cleanup()

if __name__ == '__main__':
    import torch.multiprocessing as mp
    world_size = torch.cuda.device_count()
    mp.spawn(train, args=(world_size, 5), nprocs=world_size, join=True)

Launching with torchrun

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train_ddp.py

# Multi-node (node 0 of 2)
torchrun --nnodes=2 --nproc_per_node=4 \
         --node_rank=0 \
         --master_addr="192.168.1.100" \
         --master_port=12355 \
         train_ddp.py

DataParallel vs DistributedDataParallel

# DataParallel (DP): simple but inefficient
# - all gradients funnel through GPU 0 → bottleneck
# - multi-thread, not multi-process
model_dp = nn.DataParallel(model, device_ids=[0, 1, 2, 3])

# DistributedDataParallel (DDP): recommended
# - each GPU computes gradients independently
# - efficient all-reduce synchronization
# - faster than DP even on a single GPU (avoids Python GIL)
model_ddp = DDP(model, device_ids=[rank])

18. Advanced Techniques

Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler

model     = SimpleModel().to('cuda')
optimizer = torch.optim.Adam(model.parameters())
scaler    = GradScaler()

for data, target in train_loader:
    data, target = data.to('cuda'), target.to('cuda')
    optimizer.zero_grad()

    # Forward pass in FP16
    with autocast():
        output = model(data)
        loss   = criterion(output, target)

    # Scaled backward pass
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Gradient Clipping

# Prevent exploding gradients
max_grad_norm = 1.0
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()

Reproducibility

import random
import numpy as np

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark     = False

set_seed(42)

Conclusion

This guide covered the core concepts of PyTorch from the ground up, all the way to distributed training in production. Here is a recommended learning roadmap:

Foundations: tensor operations, autograd, simple model implementations
Intermediate: CNN, RNN, Transfer Learning, DataLoader optimization
Advanced: Transformer, DDP, Mixed Precision Training
Deployment: TorchScript, ONNX, torch.compile

The PyTorch ecosystem is continuously evolving. Check the official documentation and PyTorch blog for the latest features and updates.