Skip to content
Published on

Computer Vision Complete Guide: CNN, ViT, YOLO, and Stable Diffusion

Authors

Introduction

Computer Vision is the field of AI that enables machines to understand images and video. From smartphone face unlock to autonomous vehicle obstacle detection to medical imaging assistance — computer vision powers them all.

This guide takes you from pixel-level fundamentals through Vision Transformers, Stable Diffusion, and production deployment with PyTorch code examples throughout.


1. Image Fundamentals: Pixels, Channels, Convolution

1.1 Digital Image Structure

A digital image is a 2D grid of pixels.

  • Grayscale image: 2D array of shape H x W, pixel values 0–255
  • Color image (RGB): 3D tensor of shape H x W x 3, each channel 0–255
  • Resolution: image dimensions (e.g., 1920x1080), pixel density (DPI)
import torch
import torchvision.transforms as T
from PIL import Image

# Load image and convert to tensor
img = Image.open("sample.jpg").convert("RGB")
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),          # [0,255] -> [0.0,1.0], HWC -> CHW
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
])
tensor = transform(img)   # shape: [3, 224, 224]
print(f"Shape: {tensor.shape}, dtype: {tensor.dtype}")

1.2 Convolution and Common Filters

Convolution slides a small kernel (filter) across the entire image to extract features.

import torch
import torch.nn.functional as F

# 3x3 Sobel edge detection kernel (horizontal)
kernel = torch.tensor([
    [-1, 0, 1],
    [-2, 0, 2],
    [-1, 0, 1]
], dtype=torch.float32).unsqueeze(0).unsqueeze(0)  # [1, 1, 3, 3]

# Apply convolution to a grayscale image
gray_tensor = T.ToTensor()(T.Grayscale()(img)).unsqueeze(0)
edges = F.conv2d(gray_tensor, kernel, padding=1)
Kernel TypePurposeUse Case
SobelEdge detection (H/V)Lane detection in self-driving
GaussianBlurring, noise removalImage pre-processing
LaplacianEdge sharpeningImage enhancement
AverageMean blurDownsampling

1.3 Albumentations Augmentation Pipeline

import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2

train_transform = A.Compose([
    A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.4, contrast=0.4,
                  saturation=0.4, hue=0.1, p=0.8),
    A.GaussianBlur(blur_limit=(3, 7), p=0.2),
    A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1,
                       rotate_limit=15, p=0.5),
    A.Normalize(mean=(0.485, 0.456, 0.406),
                std=(0.229, 0.224, 0.225)),
    ToTensorV2(),
])

image = cv2.cvtColor(cv2.imread("sample.jpg"), cv2.COLOR_BGR2RGB)
augmented = train_transform(image=image)["image"]

2. CNN Architecture Evolution

2.1 Major Architecture Timeline

YearArchitectureKey ContributionImageNet Top-1
1998LeNet-5First practical CNN, conv+pool structure-
2012AlexNetReLU, Dropout, GPU training63.3%
2014VGGNetStacking deep 3x3 kernels74.4%
2015ResNet-50Skip connections, residual learning76.0%
2017DenseNetDirect connections between all layers77.4%
2019EfficientNet-B7Compound scaling84.4%
2022ConvNeXt-LTransformer design principles in CNN86.6%

2.2 ResNet: The Residual Learning Revolution

ResNet's key innovation is the skip connection (residual connection). Adding input x directly to the output resolves the vanishing gradient problem. The residual block output F(x) + x has gradient dF/dx + 1, guaranteeing a minimum gradient of 1 through any residual path.

import torch
import torch.nn as nn
import torchvision.models as models

class ResNetClassifier(nn.Module):
    def __init__(self, num_classes: int, pretrained: bool = True):
        super().__init__()
        weights = models.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None
        self.backbone = models.resnet50(weights=weights)
        in_features = self.backbone.fc.in_features
        self.backbone.fc = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(in_features, num_classes)
        )

    def forward(self, x):
        return self.backbone(x)

def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss, correct = 0.0, 0
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * images.size(0)
        correct += (outputs.argmax(1) == labels).sum().item()
    return total_loss / len(loader.dataset), correct / len(loader.dataset)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ResNetClassifier(num_classes=10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
criterion = nn.CrossEntropyLoss()

2.3 EfficientNet: Compound Scaling

EfficientNet introduces a compound coefficient that scales width, depth, and resolution together with a fixed ratio, achieving superior accuracy/efficiency trade-offs.

import timm

# EfficientNet-B4 fine-tuning
model = timm.create_model(
    "efficientnet_b4",
    pretrained=True,
    num_classes=100,
    drop_rate=0.3
)

# Freeze backbone (feature extraction mode)
for name, param in model.named_parameters():
    if "classifier" not in name:
        param.requires_grad = False

2.4 ConvNeXt: Modernizing CNN

ConvNeXt applies Transformer design principles (large kernels, LayerNorm, GELU, fewer activations) to a pure CNN, matching Swin Transformer performance with comparable speed.

model = timm.create_model(
    "convnext_large",
    pretrained=True,
    num_classes=1000
)
# Uses 7x7 depthwise conv, LayerNorm, GELU activation

3. Object Detection: YOLO, DETR, Faster R-CNN

3.1 Detection Approach Comparison

ApproachModelSpeedAccuracyNotes
Anchor-based 2-stageFaster R-CNNSlowHighRPN + classifier separated
Anchor-based 1-stageYOLOv8FastMed-HighSingle pass inference
Anchor-free 1-stageYOLOv10, FCOSVery fastHighNMS-free, no anchors
TransformerDETRMediumHighEnd-to-end, relational modeling

3.2 YOLOv8 in Practice

from ultralytics import YOLO

# Load pretrained model
model = YOLO("yolov8n.pt")   # nano: speed first
# model = YOLO("yolov8x.pt") # extra-large: accuracy first

# Single image inference
results = model("image.jpg", conf=0.25, iou=0.45)
for result in results:
    for box in result.boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        print(f"Class: {result.names[cls]}, Conf: {conf:.2f}")

# Fine-tune on custom dataset
model.train(
    data="custom.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,
    device=0
)

# Evaluation
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")

3.3 YOLOv10: NMS-Free Detection

YOLOv10 eliminates the Non-Maximum Suppression (NMS) post-processing step through dual label assignment and consistency matching, enabling true end-to-end inference with reduced latency.

from ultralytics import YOLO

model = YOLO("yolov10n.pt")
# Returns final detections directly without NMS — lower latency
results = model.predict("video.mp4", stream=True)
for frame_result in results:
    print(frame_result.boxes)

3.4 DETR: Detection Transformer

DETR uses bipartite matching loss to predict the final set of boxes directly, eliminating both anchors and NMS.

import torch
from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

img = Image.open("image.jpg")
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)

target_sizes = torch.tensor([img.size[::-1]])
results = processor.post_process_object_detection(
    outputs, target_sizes=target_sizes, threshold=0.9
)[0]

for score, label, box in zip(
    results["scores"], results["labels"], results["boxes"]
):
    name = model.config.id2label[label.item()]
    print(f"{name}: {score:.3f} at {[round(i, 2) for i in box.tolist()]}")

4. Segmentation: DeepLab, Mask R-CNN, SAM

4.1 Segmentation Task Types

  • Semantic: Class label per pixel (car, road, sky...)
  • Instance: Distinguishes individual objects of the same class (car1, car2...)
  • Panoptic: Combines semantic and instance segmentation

4.2 SAM: Segment Anything Model

Meta's SAM accepts prompts (points, boxes, masks) and segments any object zero-shot. It uses a 3-module architecture: Image Encoder (ViT-H), Prompt Encoder, and Mask Decoder, trained on the SA-1B dataset (1 billion masks).

import torch
import numpy as np
import cv2
from segment_anything import sam_model_registry, SamPredictor

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)

# Set image
image = cv2.cvtColor(cv2.imread("image.jpg"), cv2.COLOR_BGR2RGB)
predictor.set_image(image)

# Point-prompted segmentation
input_point = np.array([[500, 375]])  # click location (x, y)
input_label = np.array([1])           # 1=foreground, 0=background

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True   # return multiple candidate masks
)

best_mask = masks[scores.argmax()]
print(f"Mask shape: {best_mask.shape}, Score: {scores.max():.3f}")

# Box-prompted segmentation
input_box = np.array([100, 100, 400, 400])
masks_box, _, _ = predictor.predict(
    box=input_box[None, :],
    multimask_output=False
)

4.3 DeepLabV3+ Semantic Segmentation

DeepLabV3+ uses ASPP (Atrous Spatial Pyramid Pooling) to capture multi-scale context information.

import torch
import torchvision.models.segmentation as seg_models

model = seg_models.deeplabv3_resnet101(
    weights=seg_models.DeepLabV3_ResNet101_Weights.COCO_WITH_VOC_LABELS_V1
)
model.eval()

with torch.no_grad():
    output = model(tensor.unsqueeze(0))["out"]  # [1, num_classes, H, W]
    pred = output.argmax(dim=1).squeeze()        # [H, W]
    print(f"Segmentation map shape: {pred.shape}")

4.4 Mask R-CNN Instance Segmentation

from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection import MaskRCNN_ResNet50_FPN_V2_Weights
from torchvision.utils import draw_segmentation_masks
import torchvision.transforms.functional as TF
from PIL import Image

weights = MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = maskrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.75)
model.eval()

img = Image.open("people.jpg").convert("RGB")
inp = weights.transforms()(img).unsqueeze(0)

with torch.no_grad():
    predictions = model(inp)

pred = predictions[0]
masks = (pred["masks"] > 0.5).squeeze(1)
img_uint8 = (TF.to_tensor(img) * 255).byte()
result = draw_segmentation_masks(img_uint8, masks, alpha=0.5)

5. Vision Transformer: ViT, Swin, DINOv2

5.1 ViT Core Concept

ViT (Vision Transformer) splits images into fixed-size patches (16x16) and treats each patch as a token fed into a Transformer. Unlike CNN, it learns global relationships without locality inductive bias.

import timm
import torch

# ViT-Base/16 fine-tuning
model = timm.create_model(
    "vit_base_patch16_224",
    pretrained=True,
    num_classes=10,
    img_size=224
)

# ViT benefits from strong augmentation + AdamW + cosine schedule
data_config = timm.data.resolve_model_data_config(model)
transforms_train = timm.data.create_transform(**data_config, is_training=True)
transforms_val = timm.data.create_transform(**data_config, is_training=False)

optimizer = torch.optim.AdamW(
    model.parameters(), lr=1e-3, weight_decay=0.05
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100
)

5.2 Swin Transformer: Hierarchical ViT

Swin Transformer introduces hierarchical feature maps and Shifted Window Attention, bringing CNN locality into ViT. Each stage halves resolution and doubles channels, making it FPN-compatible.

ModelResolutionParametersImageNet Top-1
ViT-B/1622486M81.8%
Swin-T22428M81.3%
Swin-B22488M83.5%
Swin-L384197M87.3%
DINOv2-L518307M86.3%

5.3 DINOv2: Self-Supervised Learning

DINOv2 is a general-purpose vision encoder trained on large-scale image data without labels, surpassing supervised ImageNet models via self-supervised learning.

import torch
import torchvision.transforms as T

# DINOv2 feature extractor (ready to use without fine-tuning)
dinov2 = torch.hub.load("facebookresearch/dinov2", "dinov2_vitl14")
dinov2.eval().cuda()

preprocess = T.Compose([
    T.Resize(518),
    T.CenterCrop(518),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

with torch.no_grad():
    # [B, 3, 518, 518] -> [B, 1024] features
    features = dinov2(preprocess(img).unsqueeze(0).cuda())
    print(f"Feature dim: {features.shape}")  # [1, 1024]

# DINOv2 achieves strong performance even with k-NN classifiers

6. Generative Models: GAN, Diffusion, ControlNet

6.1 Generative Model Comparison

Model TypeRepresentativeTrainingStrengthsWeaknesses
GANStyleGAN3AdversarialFast generationUnstable training, mode collapse
VAEVQ-VAE-2Recon + KLStable trainingBlurry images
DiffusionDDPM, DDIMDenoisingBest qualitySlow generation
LDMStable DiffusionLatent diffusionQuality + speedHigh GPU memory

6.2 Stable Diffusion: Latent Diffusion Model

Stable Diffusion is a Latent Diffusion Model (LDM). The U-Net progressively removes noise in latent space.

  • Forward process: Add Gaussian noise to the image over T steps
  • Reverse process: U-Net predicts noise epsilon from noisy latent z_t, timestep t, and text embedding
  • VAE decoder: Restores the final latent vector to pixel space
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

# SDXL 1.0 text-to-image generation
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)
pipe.to("cuda")

image = pipe(
    prompt="a photorealistic cat on a desk, 8k, studio lighting",
    negative_prompt="blurry, low quality, cartoon",
    num_inference_steps=25,
    guidance_scale=7.5,
    width=1024,
    height=1024
).images[0]
image.save("generated.png")

6.3 ControlNet: Structure-Conditioned Generation

ControlNet adds precise structural control to image generation using edge maps, depth maps, poses, and other spatial conditions.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import cv2
import numpy as np
from PIL import Image

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Generate Canny edge map
ref_image = np.array(load_image("reference.jpg"))
edges = cv2.Canny(ref_image, 100, 200)
edges_rgb = np.stack([edges] * 3, axis=-1)

result = pipe(
    "a beautiful landscape painting",
    image=Image.fromarray(edges_rgb),
    num_inference_steps=20
).images[0]
result.save("controlnet_output.png")

7. Production Pipeline: DataLoader to TensorRT

7.1 Custom Dataset and DataLoader

from torch.utils.data import Dataset, DataLoader
from pathlib import Path
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2

class CustomDataset(Dataset):
    def __init__(self, root: str, split: str = "train"):
        self.root = Path(root)
        self.image_paths = list(
            (self.root / split / "images").glob("*.jpg")
        )
        self.label_paths = [
            self.root / split / "labels" / p.with_suffix(".txt").name
            for p in self.image_paths
        ]
        self.transform = self._get_transforms(split)

    def _get_transforms(self, split: str):
        if split == "train":
            return A.Compose([
                A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
                A.HorizontalFlip(p=0.5),
                A.ColorJitter(brightness=0.4, contrast=0.4,
                              saturation=0.4, hue=0.1, p=0.8),
                A.GaussianBlur(blur_limit=(3, 7), p=0.2),
                A.Normalize(mean=(0.485, 0.456, 0.406),
                            std=(0.229, 0.224, 0.225)),
                ToTensorV2(),
            ])
        return A.Compose([
            A.Resize(256, 256),
            A.CenterCrop(224, 224),
            A.Normalize(mean=(0.485, 0.456, 0.406),
                        std=(0.229, 0.224, 0.225)),
            ToTensorV2(),
        ])

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = cv2.cvtColor(
            cv2.imread(str(self.image_paths[idx])), cv2.COLOR_BGR2RGB
        )
        label = int(self.label_paths[idx].read_text().strip())
        augmented = self.transform(image=image)
        return augmented["image"], label

train_dataset = CustomDataset("data/", split="train")
train_loader = DataLoader(
    train_dataset, batch_size=32, shuffle=True,
    num_workers=4, pin_memory=True, persistent_workers=True
)

7.2 ONNX Export

import torch
import torch.onnx

model.eval()
dummy_input = torch.randn(1, 3, 224, 224, device="cuda")

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=17,
    do_constant_folding=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input":  {0: "batch_size"},
        "output": {0: "batch_size"}
    }
)
print("ONNX export complete: model.onnx")

# Verify with ONNX Runtime
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession(
    "model.onnx", providers=["CUDAExecutionProvider"]
)
input_name = sess.get_inputs()[0].name
result = sess.run(None, {input_name: dummy_input.cpu().numpy()})
print(f"ONNX inference result shape: {result[0].shape}")

7.3 TensorRT Optimization

# TensorRT conversion using trtexec
trtexec \
  --onnx=model.onnx \
  --saveEngine=model_fp16.engine \
  --fp16 \
  --workspace=4096 \
  --optShapes=input:8x3x224x224
import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

def trt_inference(engine_path: str, input_data: np.ndarray) -> np.ndarray:
    logger = trt.Logger(trt.Logger.WARNING)
    with open(engine_path, "rb") as f:
        engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())
    context = engine.create_execution_context()

    input_mem = cuda.mem_alloc(input_data.nbytes)
    output = np.empty((input_data.shape[0], 1000), dtype=np.float32)
    output_mem = cuda.mem_alloc(output.nbytes)

    cuda.memcpy_htod(input_mem, input_data)
    context.execute_v2([int(input_mem), int(output_mem)])
    cuda.memcpy_dtoh(output, output_mem)
    return output

Quiz: Deep Understanding Check

Q1. How do ResNet skip connections solve the vanishing gradient problem?

Answer: During backpropagation, gradients flow directly through the skip connection, preventing them from vanishing in deep layers.

Explanation: In a plain deep network, backpropagated gradients shrink multiplicatively through each layer. ResNet's residual block F(x) + x differentiates to dF/dx + 1, guaranteeing a minimum gradient of 1 through any residual path. This enables stable training even beyond 100 layers.

Q2. Why is YOLO better suited for real-time inference than Faster R-CNN?

Answer: YOLO completes detection in a single forward pass, while Faster R-CNN requires two stages: a Region Proposal Network and a separate classifier.

Explanation: Faster R-CNN follows a pipeline of (1) region proposals via RPN, (2) RoI Pooling, and (3) classification and bounding-box regression. YOLO divides the image into a grid and simultaneously predicts all boxes and classes in one CNN pass. YOLOv8n achieves 80+ FPS on an A100 GPU, making it suitable for real-time applications.

Q3. Why does Vision Transformer outperform CNN on large-scale data?

Answer: ViT's Self-Attention learns global relationships between all patches without inductive bias, discovering optimal representations directly from data.

Explanation: CNNs encode locality and translation equivariance as inductive biases. These biases help with small datasets but limit representational capacity at scale. Given sufficient data (e.g., JFT-300M), ViT freely learns global patterns without these constraints, surpassing CNNs in accuracy.

Q4. What is the U-Net's role in Stable Diffusion's denoising diffusion process?

Answer: The U-Net predicts and removes the noise added to the latent vector at each timestep, and integrates the text condition (CLIP embedding) via cross-attention.

Explanation: In the forward process, Gaussian noise is added to the image latent over T steps. In the reverse process, the U-Net receives the noisy latent z_t, timestep t, and text embedding, and predicts the noise component epsilon. The VAE decoder then reconstructs the final image from the denoised latent.

Q5. How does SAM's prompt-based segmentation differ from conventional methods?

Answer: SAM segments arbitrary objects zero-shot from various prompts (points, boxes, masks) without task-specific training.

Explanation: Traditional segmentation models (DeepLab, Mask R-CNN) are trained with supervision on a fixed set of classes. SAM is a general-purpose model trained on the SA-1B dataset (1 billion masks), which segments any region specified by the user regardless of class. Its 3-module architecture — Image Encoder (ViT-H), Prompt Encoder, and Mask Decoder — separates encoding from flexible prompt conditioning.


Wrap-Up: Learning Roadmap

Computer vision evolves rapidly. Here is a recommended learning path:

  1. Foundations: OpenCV, NumPy image manipulation → torchvision hands-on
  2. Classification: ResNet/EfficientNet fine-tuning → custom dataset
  3. Detection: YOLOv8 experiments → custom training → ONNX/TensorRT deployment
  4. Segmentation: SAM exploration → Mask R-CNN / DeepLabV3+ custom training
  5. Advanced: ViT/DINOv2 feature extraction → Stable Diffusion fine-tuning

The fastest path to mastery is applying each concept in Kaggle competitions or real-world projects. Welcome to the world of computer vision!