- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- 1. Image Fundamentals: Pixels, Channels, Convolution
- 2. CNN Architecture Evolution
- 3. Object Detection: YOLO, DETR, Faster R-CNN
- 4. Segmentation: DeepLab, Mask R-CNN, SAM
- 5. Vision Transformer: ViT, Swin, DINOv2
- 6. Generative Models: GAN, Diffusion, ControlNet
- 7. Production Pipeline: DataLoader to TensorRT
- Quiz: Deep Understanding Check
- Wrap-Up: Learning Roadmap
Introduction
Computer Vision is the field of AI that enables machines to understand images and video. From smartphone face unlock to autonomous vehicle obstacle detection to medical imaging assistance — computer vision powers them all.
This guide takes you from pixel-level fundamentals through Vision Transformers, Stable Diffusion, and production deployment with PyTorch code examples throughout.
1. Image Fundamentals: Pixels, Channels, Convolution
1.1 Digital Image Structure
A digital image is a 2D grid of pixels.
- Grayscale image: 2D array of shape H x W, pixel values 0–255
- Color image (RGB): 3D tensor of shape H x W x 3, each channel 0–255
- Resolution: image dimensions (e.g., 1920x1080), pixel density (DPI)
import torch
import torchvision.transforms as T
from PIL import Image
# Load image and convert to tensor
img = Image.open("sample.jpg").convert("RGB")
transform = T.Compose([
T.Resize((224, 224)),
T.ToTensor(), # [0,255] -> [0.0,1.0], HWC -> CHW
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
tensor = transform(img) # shape: [3, 224, 224]
print(f"Shape: {tensor.shape}, dtype: {tensor.dtype}")
1.2 Convolution and Common Filters
Convolution slides a small kernel (filter) across the entire image to extract features.
import torch
import torch.nn.functional as F
# 3x3 Sobel edge detection kernel (horizontal)
kernel = torch.tensor([
[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]
], dtype=torch.float32).unsqueeze(0).unsqueeze(0) # [1, 1, 3, 3]
# Apply convolution to a grayscale image
gray_tensor = T.ToTensor()(T.Grayscale()(img)).unsqueeze(0)
edges = F.conv2d(gray_tensor, kernel, padding=1)
| Kernel Type | Purpose | Use Case |
|---|---|---|
| Sobel | Edge detection (H/V) | Lane detection in self-driving |
| Gaussian | Blurring, noise removal | Image pre-processing |
| Laplacian | Edge sharpening | Image enhancement |
| Average | Mean blur | Downsampling |
1.3 Albumentations Augmentation Pipeline
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
train_transform = A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.4, contrast=0.4,
saturation=0.4, hue=0.1, p=0.8),
A.GaussianBlur(blur_limit=(3, 7), p=0.2),
A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1,
rotate_limit=15, p=0.5),
A.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
image = cv2.cvtColor(cv2.imread("sample.jpg"), cv2.COLOR_BGR2RGB)
augmented = train_transform(image=image)["image"]
2. CNN Architecture Evolution
2.1 Major Architecture Timeline
| Year | Architecture | Key Contribution | ImageNet Top-1 |
|---|---|---|---|
| 1998 | LeNet-5 | First practical CNN, conv+pool structure | - |
| 2012 | AlexNet | ReLU, Dropout, GPU training | 63.3% |
| 2014 | VGGNet | Stacking deep 3x3 kernels | 74.4% |
| 2015 | ResNet-50 | Skip connections, residual learning | 76.0% |
| 2017 | DenseNet | Direct connections between all layers | 77.4% |
| 2019 | EfficientNet-B7 | Compound scaling | 84.4% |
| 2022 | ConvNeXt-L | Transformer design principles in CNN | 86.6% |
2.2 ResNet: The Residual Learning Revolution
ResNet's key innovation is the skip connection (residual connection). Adding input x directly to the output resolves the vanishing gradient problem. The residual block output F(x) + x has gradient dF/dx + 1, guaranteeing a minimum gradient of 1 through any residual path.
import torch
import torch.nn as nn
import torchvision.models as models
class ResNetClassifier(nn.Module):
def __init__(self, num_classes: int, pretrained: bool = True):
super().__init__()
weights = models.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None
self.backbone = models.resnet50(weights=weights)
in_features = self.backbone.fc.in_features
self.backbone.fc = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(in_features, num_classes)
)
def forward(self, x):
return self.backbone(x)
def train_one_epoch(model, loader, optimizer, criterion, device):
model.train()
total_loss, correct = 0.0, 0
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item() * images.size(0)
correct += (outputs.argmax(1) == labels).sum().item()
return total_loss / len(loader.dataset), correct / len(loader.dataset)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ResNetClassifier(num_classes=10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
criterion = nn.CrossEntropyLoss()
2.3 EfficientNet: Compound Scaling
EfficientNet introduces a compound coefficient that scales width, depth, and resolution together with a fixed ratio, achieving superior accuracy/efficiency trade-offs.
import timm
# EfficientNet-B4 fine-tuning
model = timm.create_model(
"efficientnet_b4",
pretrained=True,
num_classes=100,
drop_rate=0.3
)
# Freeze backbone (feature extraction mode)
for name, param in model.named_parameters():
if "classifier" not in name:
param.requires_grad = False
2.4 ConvNeXt: Modernizing CNN
ConvNeXt applies Transformer design principles (large kernels, LayerNorm, GELU, fewer activations) to a pure CNN, matching Swin Transformer performance with comparable speed.
model = timm.create_model(
"convnext_large",
pretrained=True,
num_classes=1000
)
# Uses 7x7 depthwise conv, LayerNorm, GELU activation
3. Object Detection: YOLO, DETR, Faster R-CNN
3.1 Detection Approach Comparison
| Approach | Model | Speed | Accuracy | Notes |
|---|---|---|---|---|
| Anchor-based 2-stage | Faster R-CNN | Slow | High | RPN + classifier separated |
| Anchor-based 1-stage | YOLOv8 | Fast | Med-High | Single pass inference |
| Anchor-free 1-stage | YOLOv10, FCOS | Very fast | High | NMS-free, no anchors |
| Transformer | DETR | Medium | High | End-to-end, relational modeling |
3.2 YOLOv8 in Practice
from ultralytics import YOLO
# Load pretrained model
model = YOLO("yolov8n.pt") # nano: speed first
# model = YOLO("yolov8x.pt") # extra-large: accuracy first
# Single image inference
results = model("image.jpg", conf=0.25, iou=0.45)
for result in results:
for box in result.boxes:
cls = int(box.cls[0])
conf = float(box.conf[0])
x1, y1, x2, y2 = box.xyxy[0].tolist()
print(f"Class: {result.names[cls]}, Conf: {conf:.2f}")
# Fine-tune on custom dataset
model.train(
data="custom.yaml",
epochs=100,
imgsz=640,
batch=16,
lr0=0.01,
device=0
)
# Evaluation
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")
3.3 YOLOv10: NMS-Free Detection
YOLOv10 eliminates the Non-Maximum Suppression (NMS) post-processing step through dual label assignment and consistency matching, enabling true end-to-end inference with reduced latency.
from ultralytics import YOLO
model = YOLO("yolov10n.pt")
# Returns final detections directly without NMS — lower latency
results = model.predict("video.mp4", stream=True)
for frame_result in results:
print(frame_result.boxes)
3.4 DETR: Detection Transformer
DETR uses bipartite matching loss to predict the final set of boxes directly, eliminating both anchors and NMS.
import torch
from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
img = Image.open("image.jpg")
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
target_sizes = torch.tensor([img.size[::-1]])
results = processor.post_process_object_detection(
outputs, target_sizes=target_sizes, threshold=0.9
)[0]
for score, label, box in zip(
results["scores"], results["labels"], results["boxes"]
):
name = model.config.id2label[label.item()]
print(f"{name}: {score:.3f} at {[round(i, 2) for i in box.tolist()]}")
4. Segmentation: DeepLab, Mask R-CNN, SAM
4.1 Segmentation Task Types
- Semantic: Class label per pixel (car, road, sky...)
- Instance: Distinguishes individual objects of the same class (car1, car2...)
- Panoptic: Combines semantic and instance segmentation
4.2 SAM: Segment Anything Model
Meta's SAM accepts prompts (points, boxes, masks) and segments any object zero-shot. It uses a 3-module architecture: Image Encoder (ViT-H), Prompt Encoder, and Mask Decoder, trained on the SA-1B dataset (1 billion masks).
import torch
import numpy as np
import cv2
from segment_anything import sam_model_registry, SamPredictor
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)
# Set image
image = cv2.cvtColor(cv2.imread("image.jpg"), cv2.COLOR_BGR2RGB)
predictor.set_image(image)
# Point-prompted segmentation
input_point = np.array([[500, 375]]) # click location (x, y)
input_label = np.array([1]) # 1=foreground, 0=background
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True # return multiple candidate masks
)
best_mask = masks[scores.argmax()]
print(f"Mask shape: {best_mask.shape}, Score: {scores.max():.3f}")
# Box-prompted segmentation
input_box = np.array([100, 100, 400, 400])
masks_box, _, _ = predictor.predict(
box=input_box[None, :],
multimask_output=False
)
4.3 DeepLabV3+ Semantic Segmentation
DeepLabV3+ uses ASPP (Atrous Spatial Pyramid Pooling) to capture multi-scale context information.
import torch
import torchvision.models.segmentation as seg_models
model = seg_models.deeplabv3_resnet101(
weights=seg_models.DeepLabV3_ResNet101_Weights.COCO_WITH_VOC_LABELS_V1
)
model.eval()
with torch.no_grad():
output = model(tensor.unsqueeze(0))["out"] # [1, num_classes, H, W]
pred = output.argmax(dim=1).squeeze() # [H, W]
print(f"Segmentation map shape: {pred.shape}")
4.4 Mask R-CNN Instance Segmentation
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection import MaskRCNN_ResNet50_FPN_V2_Weights
from torchvision.utils import draw_segmentation_masks
import torchvision.transforms.functional as TF
from PIL import Image
weights = MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = maskrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.75)
model.eval()
img = Image.open("people.jpg").convert("RGB")
inp = weights.transforms()(img).unsqueeze(0)
with torch.no_grad():
predictions = model(inp)
pred = predictions[0]
masks = (pred["masks"] > 0.5).squeeze(1)
img_uint8 = (TF.to_tensor(img) * 255).byte()
result = draw_segmentation_masks(img_uint8, masks, alpha=0.5)
5. Vision Transformer: ViT, Swin, DINOv2
5.1 ViT Core Concept
ViT (Vision Transformer) splits images into fixed-size patches (16x16) and treats each patch as a token fed into a Transformer. Unlike CNN, it learns global relationships without locality inductive bias.
import timm
import torch
# ViT-Base/16 fine-tuning
model = timm.create_model(
"vit_base_patch16_224",
pretrained=True,
num_classes=10,
img_size=224
)
# ViT benefits from strong augmentation + AdamW + cosine schedule
data_config = timm.data.resolve_model_data_config(model)
transforms_train = timm.data.create_transform(**data_config, is_training=True)
transforms_val = timm.data.create_transform(**data_config, is_training=False)
optimizer = torch.optim.AdamW(
model.parameters(), lr=1e-3, weight_decay=0.05
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100
)
5.2 Swin Transformer: Hierarchical ViT
Swin Transformer introduces hierarchical feature maps and Shifted Window Attention, bringing CNN locality into ViT. Each stage halves resolution and doubles channels, making it FPN-compatible.
| Model | Resolution | Parameters | ImageNet Top-1 |
|---|---|---|---|
| ViT-B/16 | 224 | 86M | 81.8% |
| Swin-T | 224 | 28M | 81.3% |
| Swin-B | 224 | 88M | 83.5% |
| Swin-L | 384 | 197M | 87.3% |
| DINOv2-L | 518 | 307M | 86.3% |
5.3 DINOv2: Self-Supervised Learning
DINOv2 is a general-purpose vision encoder trained on large-scale image data without labels, surpassing supervised ImageNet models via self-supervised learning.
import torch
import torchvision.transforms as T
# DINOv2 feature extractor (ready to use without fine-tuning)
dinov2 = torch.hub.load("facebookresearch/dinov2", "dinov2_vitl14")
dinov2.eval().cuda()
preprocess = T.Compose([
T.Resize(518),
T.CenterCrop(518),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
with torch.no_grad():
# [B, 3, 518, 518] -> [B, 1024] features
features = dinov2(preprocess(img).unsqueeze(0).cuda())
print(f"Feature dim: {features.shape}") # [1, 1024]
# DINOv2 achieves strong performance even with k-NN classifiers
6. Generative Models: GAN, Diffusion, ControlNet
6.1 Generative Model Comparison
| Model Type | Representative | Training | Strengths | Weaknesses |
|---|---|---|---|---|
| GAN | StyleGAN3 | Adversarial | Fast generation | Unstable training, mode collapse |
| VAE | VQ-VAE-2 | Recon + KL | Stable training | Blurry images |
| Diffusion | DDPM, DDIM | Denoising | Best quality | Slow generation |
| LDM | Stable Diffusion | Latent diffusion | Quality + speed | High GPU memory |
6.2 Stable Diffusion: Latent Diffusion Model
Stable Diffusion is a Latent Diffusion Model (LDM). The U-Net progressively removes noise in latent space.
- Forward process: Add Gaussian noise to the image over T steps
- Reverse process: U-Net predicts noise epsilon from noisy latent z_t, timestep t, and text embedding
- VAE decoder: Restores the final latent vector to pixel space
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
# SDXL 1.0 text-to-image generation
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config
)
pipe.to("cuda")
image = pipe(
prompt="a photorealistic cat on a desk, 8k, studio lighting",
negative_prompt="blurry, low quality, cartoon",
num_inference_steps=25,
guidance_scale=7.5,
width=1024,
height=1024
).images[0]
image.save("generated.png")
6.3 ControlNet: Structure-Conditioned Generation
ControlNet adds precise structural control to image generation using edge maps, depth maps, poses, and other spatial conditions.
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import cv2
import numpy as np
from PIL import Image
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny",
torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16
).to("cuda")
# Generate Canny edge map
ref_image = np.array(load_image("reference.jpg"))
edges = cv2.Canny(ref_image, 100, 200)
edges_rgb = np.stack([edges] * 3, axis=-1)
result = pipe(
"a beautiful landscape painting",
image=Image.fromarray(edges_rgb),
num_inference_steps=20
).images[0]
result.save("controlnet_output.png")
7. Production Pipeline: DataLoader to TensorRT
7.1 Custom Dataset and DataLoader
from torch.utils.data import Dataset, DataLoader
from pathlib import Path
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
class CustomDataset(Dataset):
def __init__(self, root: str, split: str = "train"):
self.root = Path(root)
self.image_paths = list(
(self.root / split / "images").glob("*.jpg")
)
self.label_paths = [
self.root / split / "labels" / p.with_suffix(".txt").name
for p in self.image_paths
]
self.transform = self._get_transforms(split)
def _get_transforms(self, split: str):
if split == "train":
return A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.4, contrast=0.4,
saturation=0.4, hue=0.1, p=0.8),
A.GaussianBlur(blur_limit=(3, 7), p=0.2),
A.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
return A.Compose([
A.Resize(256, 256),
A.CenterCrop(224, 224),
A.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
image = cv2.cvtColor(
cv2.imread(str(self.image_paths[idx])), cv2.COLOR_BGR2RGB
)
label = int(self.label_paths[idx].read_text().strip())
augmented = self.transform(image=image)
return augmented["image"], label
train_dataset = CustomDataset("data/", split="train")
train_loader = DataLoader(
train_dataset, batch_size=32, shuffle=True,
num_workers=4, pin_memory=True, persistent_workers=True
)
7.2 ONNX Export
import torch
import torch.onnx
model.eval()
dummy_input = torch.randn(1, 3, 224, 224, device="cuda")
torch.onnx.export(
model,
dummy_input,
"model.onnx",
export_params=True,
opset_version=17,
do_constant_folding=True,
input_names=["input"],
output_names=["output"],
dynamic_axes={
"input": {0: "batch_size"},
"output": {0: "batch_size"}
}
)
print("ONNX export complete: model.onnx")
# Verify with ONNX Runtime
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession(
"model.onnx", providers=["CUDAExecutionProvider"]
)
input_name = sess.get_inputs()[0].name
result = sess.run(None, {input_name: dummy_input.cpu().numpy()})
print(f"ONNX inference result shape: {result[0].shape}")
7.3 TensorRT Optimization
# TensorRT conversion using trtexec
trtexec \
--onnx=model.onnx \
--saveEngine=model_fp16.engine \
--fp16 \
--workspace=4096 \
--optShapes=input:8x3x224x224
import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
def trt_inference(engine_path: str, input_data: np.ndarray) -> np.ndarray:
logger = trt.Logger(trt.Logger.WARNING)
with open(engine_path, "rb") as f:
engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
input_mem = cuda.mem_alloc(input_data.nbytes)
output = np.empty((input_data.shape[0], 1000), dtype=np.float32)
output_mem = cuda.mem_alloc(output.nbytes)
cuda.memcpy_htod(input_mem, input_data)
context.execute_v2([int(input_mem), int(output_mem)])
cuda.memcpy_dtoh(output, output_mem)
return output
Quiz: Deep Understanding Check
Q1. How do ResNet skip connections solve the vanishing gradient problem?
Answer: During backpropagation, gradients flow directly through the skip connection, preventing them from vanishing in deep layers.
Explanation: In a plain deep network, backpropagated gradients shrink multiplicatively through each layer. ResNet's residual block F(x) + x differentiates to dF/dx + 1, guaranteeing a minimum gradient of 1 through any residual path. This enables stable training even beyond 100 layers.
Q2. Why is YOLO better suited for real-time inference than Faster R-CNN?
Answer: YOLO completes detection in a single forward pass, while Faster R-CNN requires two stages: a Region Proposal Network and a separate classifier.
Explanation: Faster R-CNN follows a pipeline of (1) region proposals via RPN, (2) RoI Pooling, and (3) classification and bounding-box regression. YOLO divides the image into a grid and simultaneously predicts all boxes and classes in one CNN pass. YOLOv8n achieves 80+ FPS on an A100 GPU, making it suitable for real-time applications.
Q3. Why does Vision Transformer outperform CNN on large-scale data?
Answer: ViT's Self-Attention learns global relationships between all patches without inductive bias, discovering optimal representations directly from data.
Explanation: CNNs encode locality and translation equivariance as inductive biases. These biases help with small datasets but limit representational capacity at scale. Given sufficient data (e.g., JFT-300M), ViT freely learns global patterns without these constraints, surpassing CNNs in accuracy.
Q4. What is the U-Net's role in Stable Diffusion's denoising diffusion process?
Answer: The U-Net predicts and removes the noise added to the latent vector at each timestep, and integrates the text condition (CLIP embedding) via cross-attention.
Explanation: In the forward process, Gaussian noise is added to the image latent over T steps. In the reverse process, the U-Net receives the noisy latent z_t, timestep t, and text embedding, and predicts the noise component epsilon. The VAE decoder then reconstructs the final image from the denoised latent.
Q5. How does SAM's prompt-based segmentation differ from conventional methods?
Answer: SAM segments arbitrary objects zero-shot from various prompts (points, boxes, masks) without task-specific training.
Explanation: Traditional segmentation models (DeepLab, Mask R-CNN) are trained with supervision on a fixed set of classes. SAM is a general-purpose model trained on the SA-1B dataset (1 billion masks), which segments any region specified by the user regardless of class. Its 3-module architecture — Image Encoder (ViT-H), Prompt Encoder, and Mask Decoder — separates encoding from flexible prompt conditioning.
Wrap-Up: Learning Roadmap
Computer vision evolves rapidly. Here is a recommended learning path:
- Foundations: OpenCV, NumPy image manipulation → torchvision hands-on
- Classification: ResNet/EfficientNet fine-tuning → custom dataset
- Detection: YOLOv8 experiments → custom training → ONNX/TensorRT deployment
- Segmentation: SAM exploration → Mask R-CNN / DeepLabV3+ custom training
- Advanced: ViT/DINOv2 feature extraction → Stable Diffusion fine-tuning
The fastest path to mastery is applying each concept in Kaggle competitions or real-world projects. Welcome to the world of computer vision!