Computer Vision Complete Guide: CNN, ViT, YOLO, and Stable Diffusion

Introduction
1. Image Fundamentals: Pixels, Channels, Convolution
2. CNN Architecture Evolution
3. Object Detection: YOLO, DETR, Faster R-CNN
4. Segmentation: DeepLab, Mask R-CNN, SAM
5. Vision Transformer: ViT, Swin, DINOv2
6. Generative Models: GAN, Diffusion, ControlNet
7. Production Pipeline: DataLoader to TensorRT
Quiz: Deep Understanding Check
Wrap-Up: Learning Roadmap

Introduction

Computer Vision is the field of AI that enables machines to understand images and video. From smartphone face unlock to autonomous vehicle obstacle detection to medical imaging assistance — computer vision powers them all.

This guide takes you from pixel-level fundamentals through Vision Transformers, Stable Diffusion, and production deployment with PyTorch code examples throughout.

1. Image Fundamentals: Pixels, Channels, Convolution

1.1 Digital Image Structure

A digital image is a 2D grid of pixels.

Grayscale image: 2D array of shape H x W, pixel values 0–255
Color image (RGB): 3D tensor of shape H x W x 3, each channel 0–255
Resolution: image dimensions (e.g., 1920x1080), pixel density (DPI)

import torch
import torchvision.transforms as T
from PIL import Image

# Load image and convert to tensor
img = Image.open("sample.jpg").convert("RGB")
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),          # [0,255] -> [0.0,1.0], HWC -> CHW
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
])
tensor = transform(img)   # shape: [3, 224, 224]
print(f"Shape: {tensor.shape}, dtype: {tensor.dtype}")

1.2 Convolution and Common Filters

Convolution slides a small kernel (filter) across the entire image to extract features.

import torch
import torch.nn.functional as F

# 3x3 Sobel edge detection kernel (horizontal)
kernel = torch.tensor([
    [-1, 0, 1],
    [-2, 0, 2],
    [-1, 0, 1]
], dtype=torch.float32).unsqueeze(0).unsqueeze(0)  # [1, 1, 3, 3]

# Apply convolution to a grayscale image
gray_tensor = T.ToTensor()(T.Grayscale()(img)).unsqueeze(0)
edges = F.conv2d(gray_tensor, kernel, padding=1)

Kernel Type	Purpose	Use Case
Sobel	Edge detection (H/V)	Lane detection in self-driving
Gaussian	Blurring, noise removal	Image pre-processing
Laplacian	Edge sharpening	Image enhancement
Average	Mean blur	Downsampling

1.3 Albumentations Augmentation Pipeline

import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2

train_transform = A.Compose([
    A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.4, contrast=0.4,
                  saturation=0.4, hue=0.1, p=0.8),
    A.GaussianBlur(blur_limit=(3, 7), p=0.2),
    A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1,
                       rotate_limit=15, p=0.5),
    A.Normalize(mean=(0.485, 0.456, 0.406),
                std=(0.229, 0.224, 0.225)),
    ToTensorV2(),
])

image = cv2.cvtColor(cv2.imread("sample.jpg"), cv2.COLOR_BGR2RGB)
augmented = train_transform(image=image)["image"]

2. CNN Architecture Evolution

2.1 Major Architecture Timeline

Year	Architecture	Key Contribution	ImageNet Top-1
1998	LeNet-5	First practical CNN, conv+pool structure	-
2012	AlexNet	ReLU, Dropout, GPU training	63.3%
2014	VGGNet	Stacking deep 3x3 kernels	74.4%
2015	ResNet-50	Skip connections, residual learning	76.0%
2017	DenseNet	Direct connections between all layers	77.4%
2019	EfficientNet-B7	Compound scaling	84.4%
2022	ConvNeXt-L	Transformer design principles in CNN	86.6%

2.2 ResNet: The Residual Learning Revolution

ResNet's key innovation is the skip connection (residual connection). Adding input x directly to the output resolves the vanishing gradient problem. The residual block output F(x) + x has gradient dF/dx + 1, guaranteeing a minimum gradient of 1 through any residual path.

import torch
import torch.nn as nn
import torchvision.models as models

class ResNetClassifier(nn.Module):
    def __init__(self, num_classes: int, pretrained: bool = True):
        super().__init__()
        weights = models.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None
        self.backbone = models.resnet50(weights=weights)
        in_features = self.backbone.fc.in_features
        self.backbone.fc = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(in_features, num_classes)
        )

    def forward(self, x):
        return self.backbone(x)

def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss, correct = 0.0, 0
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * images.size(0)
        correct += (outputs.argmax(1) == labels).sum().item()
    return total_loss / len(loader.dataset), correct / len(loader.dataset)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ResNetClassifier(num_classes=10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
criterion = nn.CrossEntropyLoss()

2.3 EfficientNet: Compound Scaling

EfficientNet introduces a compound coefficient that scales width, depth, and resolution together with a fixed ratio, achieving superior accuracy/efficiency trade-offs.

import timm

# EfficientNet-B4 fine-tuning
model = timm.create_model(
    "efficientnet_b4",
    pretrained=True,
    num_classes=100,
    drop_rate=0.3
)

# Freeze backbone (feature extraction mode)
for name, param in model.named_parameters():
    if "classifier" not in name:
        param.requires_grad = False

2.4 ConvNeXt: Modernizing CNN

ConvNeXt applies Transformer design principles (large kernels, LayerNorm, GELU, fewer activations) to a pure CNN, matching Swin Transformer performance with comparable speed.

model = timm.create_model(
    "convnext_large",
    pretrained=True,
    num_classes=1000
)
# Uses 7x7 depthwise conv, LayerNorm, GELU activation

3. Object Detection: YOLO, DETR, Faster R-CNN

3.1 Detection Approach Comparison

Approach	Model	Speed	Accuracy	Notes
Anchor-based 2-stage	Faster R-CNN	Slow	High	RPN + classifier separated
Anchor-based 1-stage	YOLOv8	Fast	Med-High	Single pass inference
Anchor-free 1-stage	YOLOv10, FCOS	Very fast	High	NMS-free, no anchors
Transformer	DETR	Medium	High	End-to-end, relational modeling

3.2 YOLOv8 in Practice

from ultralytics import YOLO

# Load pretrained model
model = YOLO("yolov8n.pt")   # nano: speed first
# model = YOLO("yolov8x.pt") # extra-large: accuracy first

# Single image inference
results = model("image.jpg", conf=0.25, iou=0.45)
for result in results:
    for box in result.boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        print(f"Class: {result.names[cls]}, Conf: {conf:.2f}")

# Fine-tune on custom dataset
model.train(
    data="custom.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,
    device=0
)

# Evaluation
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")

3.3 YOLOv10: NMS-Free Detection

YOLOv10 eliminates the Non-Maximum Suppression (NMS) post-processing step through dual label assignment and consistency matching, enabling true end-to-end inference with reduced latency.

from ultralytics import YOLO

model = YOLO("yolov10n.pt")
# Returns final detections directly without NMS — lower latency
results = model.predict("video.mp4", stream=True)
for frame_result in results:
    print(frame_result.boxes)

3.4 DETR: Detection Transformer

DETR uses bipartite matching loss to predict the final set of boxes directly, eliminating both anchors and NMS.

import torch
from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

img = Image.open("image.jpg")
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)

target_sizes = torch.tensor([img.size[::-1]])
results = processor.post_process_object_detection(
    outputs, target_sizes=target_sizes, threshold=0.9
)[0]

for score, label, box in zip(
    results["scores"], results["labels"], results["boxes"]
):
    name = model.config.id2label[label.item()]
    print(f"{name}: {score:.3f} at {[round(i, 2) for i in box.tolist()]}")

4. Segmentation: DeepLab, Mask R-CNN, SAM

4.1 Segmentation Task Types

Semantic: Class label per pixel (car, road, sky...)
Instance: Distinguishes individual objects of the same class (car1, car2...)
Panoptic: Combines semantic and instance segmentation

4.2 SAM: Segment Anything Model

Meta's SAM accepts prompts (points, boxes, masks) and segments any object zero-shot. It uses a 3-module architecture: Image Encoder (ViT-H), Prompt Encoder, and Mask Decoder, trained on the SA-1B dataset (1 billion masks).

import torch
import numpy as np
import cv2
from segment_anything import sam_model_registry, SamPredictor

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)

# Set image
image = cv2.cvtColor(cv2.imread("image.jpg"), cv2.COLOR_BGR2RGB)
predictor.set_image(image)

# Point-prompted segmentation
input_point = np.array([[500, 375]])  # click location (x, y)
input_label = np.array([1])           # 1=foreground, 0=background

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True   # return multiple candidate masks
)

best_mask = masks[scores.argmax()]
print(f"Mask shape: {best_mask.shape}, Score: {scores.max():.3f}")

# Box-prompted segmentation
input_box = np.array([100, 100, 400, 400])
masks_box, _, _ = predictor.predict(
    box=input_box[None, :],
    multimask_output=False
)

4.3 DeepLabV3+ Semantic Segmentation

DeepLabV3+ uses ASPP (Atrous Spatial Pyramid Pooling) to capture multi-scale context information.

import torch
import torchvision.models.segmentation as seg_models

model = seg_models.deeplabv3_resnet101(
    weights=seg_models.DeepLabV3_ResNet101_Weights.COCO_WITH_VOC_LABELS_V1
)
model.eval()

with torch.no_grad():
    output = model(tensor.unsqueeze(0))["out"]  # [1, num_classes, H, W]
    pred = output.argmax(dim=1).squeeze()        # [H, W]
    print(f"Segmentation map shape: {pred.shape}")

4.4 Mask R-CNN Instance Segmentation

from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection import MaskRCNN_ResNet50_FPN_V2_Weights
from torchvision.utils import draw_segmentation_masks
import torchvision.transforms.functional as TF
from PIL import Image

weights = MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = maskrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.75)
model.eval()

img = Image.open("people.jpg").convert("RGB")
inp = weights.transforms()(img).unsqueeze(0)

with torch.no_grad():
    predictions = model(inp)

pred = predictions[0]
masks = (pred["masks"] > 0.5).squeeze(1)
img_uint8 = (TF.to_tensor(img) * 255).byte()
result = draw_segmentation_masks(img_uint8, masks, alpha=0.5)

5. Vision Transformer: ViT, Swin, DINOv2

5.1 ViT Core Concept

ViT (Vision Transformer) splits images into fixed-size patches (16x16) and treats each patch as a token fed into a Transformer. Unlike CNN, it learns global relationships without locality inductive bias.

import timm
import torch

# ViT-Base/16 fine-tuning
model = timm.create_model(
    "vit_base_patch16_224",
    pretrained=True,
    num_classes=10,
    img_size=224
)

# ViT benefits from strong augmentation + AdamW + cosine schedule
data_config = timm.data.resolve_model_data_config(model)
transforms_train = timm.data.create_transform(**data_config, is_training=True)
transforms_val = timm.data.create_transform(**data_config, is_training=False)

optimizer = torch.optim.AdamW(
    model.parameters(), lr=1e-3, weight_decay=0.05
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100
)

5.2 Swin Transformer: Hierarchical ViT

Swin Transformer introduces hierarchical feature maps and Shifted Window Attention, bringing CNN locality into ViT. Each stage halves resolution and doubles channels, making it FPN-compatible.

Model	Resolution	Parameters	ImageNet Top-1
ViT-B/16	224	86M	81.8%
Swin-T	224	28M	81.3%
Swin-B	224	88M	83.5%
Swin-L	384	197M	87.3%
DINOv2-L	518	307M	86.3%

5.3 DINOv2: Self-Supervised Learning

DINOv2 is a general-purpose vision encoder trained on large-scale image data without labels, surpassing supervised ImageNet models via self-supervised learning.

import torch
import torchvision.transforms as T

# DINOv2 feature extractor (ready to use without fine-tuning)
dinov2 = torch.hub.load("facebookresearch/dinov2", "dinov2_vitl14")
dinov2.eval().cuda()

preprocess = T.Compose([
    T.Resize(518),
    T.CenterCrop(518),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

with torch.no_grad():
    # [B, 3, 518, 518] -> [B, 1024] features
    features = dinov2(preprocess(img).unsqueeze(0).cuda())
    print(f"Feature dim: {features.shape}")  # [1, 1024]

# DINOv2 achieves strong performance even with k-NN classifiers

6. Generative Models: GAN, Diffusion, ControlNet

6.1 Generative Model Comparison

Model Type	Representative	Training	Strengths	Weaknesses
GAN	StyleGAN3	Adversarial	Fast generation	Unstable training, mode collapse
VAE	VQ-VAE-2	Recon + KL	Stable training	Blurry images
Diffusion	DDPM, DDIM	Denoising	Best quality	Slow generation
LDM	Stable Diffusion	Latent diffusion	Quality + speed	High GPU memory

6.2 Stable Diffusion: Latent Diffusion Model

Stable Diffusion is a Latent Diffusion Model (LDM). The U-Net progressively removes noise in latent space.

Forward process: Add Gaussian noise to the image over T steps
Reverse process: U-Net predicts noise epsilon from noisy latent z_t, timestep t, and text embedding
VAE decoder: Restores the final latent vector to pixel space

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

# SDXL 1.0 text-to-image generation
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)
pipe.to("cuda")

image = pipe(
    prompt="a photorealistic cat on a desk, 8k, studio lighting",
    negative_prompt="blurry, low quality, cartoon",
    num_inference_steps=25,
    guidance_scale=7.5,
    width=1024,
    height=1024
).images[0]
image.save("generated.png")

6.3 ControlNet: Structure-Conditioned Generation

ControlNet adds precise structural control to image generation using edge maps, depth maps, poses, and other spatial conditions.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import cv2
import numpy as np
from PIL import Image

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Generate Canny edge map
ref_image = np.array(load_image("reference.jpg"))
edges = cv2.Canny(ref_image, 100, 200)
edges_rgb = np.stack([edges] * 3, axis=-1)

result = pipe(
    "a beautiful landscape painting",
    image=Image.fromarray(edges_rgb),
    num_inference_steps=20
).images[0]
result.save("controlnet_output.png")

7. Production Pipeline: DataLoader to TensorRT

7.1 Custom Dataset and DataLoader

from torch.utils.data import Dataset, DataLoader
from pathlib import Path
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2

class CustomDataset(Dataset):
    def __init__(self, root: str, split: str = "train"):
        self.root = Path(root)
        self.image_paths = list(
            (self.root / split / "images").glob("*.jpg")
        )
        self.label_paths = [
            self.root / split / "labels" / p.with_suffix(".txt").name
            for p in self.image_paths
        ]
        self.transform = self._get_transforms(split)

    def _get_transforms(self, split: str):
        if split == "train":
            return A.Compose([
                A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
                A.HorizontalFlip(p=0.5),
                A.ColorJitter(brightness=0.4, contrast=0.4,
                              saturation=0.4, hue=0.1, p=0.8),
                A.GaussianBlur(blur_limit=(3, 7), p=0.2),
                A.Normalize(mean=(0.485, 0.456, 0.406),
                            std=(0.229, 0.224, 0.225)),
                ToTensorV2(),
            ])
        return A.Compose([
            A.Resize(256, 256),
            A.CenterCrop(224, 224),
            A.Normalize(mean=(0.485, 0.456, 0.406),
                        std=(0.229, 0.224, 0.225)),
            ToTensorV2(),
        ])

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = cv2.cvtColor(
            cv2.imread(str(self.image_paths[idx])), cv2.COLOR_BGR2RGB
        )
        label = int(self.label_paths[idx].read_text().strip())
        augmented = self.transform(image=image)
        return augmented["image"], label

train_dataset = CustomDataset("data/", split="train")
train_loader = DataLoader(
    train_dataset, batch_size=32, shuffle=True,
    num_workers=4, pin_memory=True, persistent_workers=True
)

7.2 ONNX Export

import torch
import torch.onnx

model.eval()
dummy_input = torch.randn(1, 3, 224, 224, device="cuda")

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=17,
    do_constant_folding=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input":  {0: "batch_size"},
        "output": {0: "batch_size"}
    }
)
print("ONNX export complete: model.onnx")

# Verify with ONNX Runtime
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession(
    "model.onnx", providers=["CUDAExecutionProvider"]
)
input_name = sess.get_inputs()[0].name
result = sess.run(None, {input_name: dummy_input.cpu().numpy()})
print(f"ONNX inference result shape: {result[0].shape}")

7.3 TensorRT Optimization

# TensorRT conversion using trtexec
trtexec \
  --onnx=model.onnx \
  --saveEngine=model_fp16.engine \
  --fp16 \
  --workspace=4096 \
  --optShapes=input:8x3x224x224

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

def trt_inference(engine_path: str, input_data: np.ndarray) -> np.ndarray:
    logger = trt.Logger(trt.Logger.WARNING)
    with open(engine_path, "rb") as f:
        engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())
    context = engine.create_execution_context()

    input_mem = cuda.mem_alloc(input_data.nbytes)
    output = np.empty((input_data.shape[0], 1000), dtype=np.float32)
    output_mem = cuda.mem_alloc(output.nbytes)

    cuda.memcpy_htod(input_mem, input_data)
    context.execute_v2([int(input_mem), int(output_mem)])
    cuda.memcpy_dtoh(output, output_mem)
    return output

Quiz: Deep Understanding Check

Q1. How do ResNet skip connections solve the vanishing gradient problem?

Answer: During backpropagation, gradients flow directly through the skip connection, preventing them from vanishing in deep layers.

Explanation: In a plain deep network, backpropagated gradients shrink multiplicatively through each layer. ResNet's residual block F(x) + x differentiates to dF/dx + 1, guaranteeing a minimum gradient of 1 through any residual path. This enables stable training even beyond 100 layers.

Q2. Why is YOLO better suited for real-time inference than Faster R-CNN?

Answer: YOLO completes detection in a single forward pass, while Faster R-CNN requires two stages: a Region Proposal Network and a separate classifier.

Explanation: Faster R-CNN follows a pipeline of (1) region proposals via RPN, (2) RoI Pooling, and (3) classification and bounding-box regression. YOLO divides the image into a grid and simultaneously predicts all boxes and classes in one CNN pass. YOLOv8n achieves 80+ FPS on an A100 GPU, making it suitable for real-time applications.

Q3. Why does Vision Transformer outperform CNN on large-scale data?

Answer: ViT's Self-Attention learns global relationships between all patches without inductive bias, discovering optimal representations directly from data.

Explanation: CNNs encode locality and translation equivariance as inductive biases. These biases help with small datasets but limit representational capacity at scale. Given sufficient data (e.g., JFT-300M), ViT freely learns global patterns without these constraints, surpassing CNNs in accuracy.

Q4. What is the U-Net's role in Stable Diffusion's denoising diffusion process?

Answer: The U-Net predicts and removes the noise added to the latent vector at each timestep, and integrates the text condition (CLIP embedding) via cross-attention.

Explanation: In the forward process, Gaussian noise is added to the image latent over T steps. In the reverse process, the U-Net receives the noisy latent z_t, timestep t, and text embedding, and predicts the noise component epsilon. The VAE decoder then reconstructs the final image from the denoised latent.

Q5. How does SAM's prompt-based segmentation differ from conventional methods?

Answer: SAM segments arbitrary objects zero-shot from various prompts (points, boxes, masks) without task-specific training.

Explanation: Traditional segmentation models (DeepLab, Mask R-CNN) are trained with supervision on a fixed set of classes. SAM is a general-purpose model trained on the SA-1B dataset (1 billion masks), which segments any region specified by the user regardless of class. Its 3-module architecture — Image Encoder (ViT-H), Prompt Encoder, and Mask Decoder — separates encoding from flexible prompt conditioning.

Wrap-Up: Learning Roadmap

Computer vision evolves rapidly. Here is a recommended learning path:

Foundations: OpenCV, NumPy image manipulation → torchvision hands-on
Classification: ResNet/EfficientNet fine-tuning → custom dataset
Detection: YOLOv8 experiments → custom training → ONNX/TensorRT deployment
Segmentation: SAM exploration → Mask R-CNN / DeepLabV3+ custom training
Advanced: ViT/DINOv2 feature extraction → Stable Diffusion fine-tuning

The fastest path to mastery is applying each concept in Kaggle competitions or real-world projects. Welcome to the world of computer vision!