Split View: 컴퓨터 비전 완전 정복: CNN부터 ViT, YOLO, Stable Diffusion까지
컴퓨터 비전 완전 정복: CNN부터 ViT, YOLO, Stable Diffusion까지
- 들어가며
- 1. 이미지 기초: 픽셀, 채널, 컨볼루션
- 2. CNN 아키텍처 발전사
- 3. 객체 탐지: YOLO, DETR, Faster R-CNN
- 4. 세그멘테이션: DeepLab, Mask R-CNN, SAM
- 5. Vision Transformer: ViT, Swin, DINOv2
- 6. 생성 모델: GAN, Diffusion, ControlNet
- 7. 실전 파이프라인: DataLoader부터 TensorRT까지
- 퀴즈: 컴퓨터 비전 심화 이해
- 마무리: 학습 로드맵
들어가며
컴퓨터 비전(Computer Vision)은 기계가 이미지와 동영상을 이해하는 AI의 핵심 분야입니다. 스마트폰 얼굴 잠금 해제, 자율주행 차량의 장애물 인식, 의료 영상 진단 보조 — 모두 컴퓨터 비전 기술이 작동하고 있습니다.
이 가이드는 픽셀 수준의 기초부터 최신 Vision Transformer, Stable Diffusion까지 체계적으로 다룹니다.
1. 이미지 기초: 픽셀, 채널, 컨볼루션
1.1 디지털 이미지의 구조
디지털 이미지는 픽셀(pixel)의 2D 격자입니다.
- 그레이스케일 이미지: H x W 형태의 2D 배열, 각 픽셀 값 0~255
- 컬러 이미지(RGB): H x W x 3 형태의 3D 텐서, R/G/B 각 채널 0~255
- 해상도: 이미지 크기 (예: 1920x1080), 픽셀 밀도(DPI)
import torch
import torchvision.transforms as T
from PIL import Image
# 이미지 로드 및 텐서 변환
img = Image.open("sample.jpg").convert("RGB")
transform = T.Compose([
T.Resize((224, 224)),
T.ToTensor(), # [0,255] -> [0.0,1.0], HWC -> CHW
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
tensor = transform(img) # shape: [3, 224, 224]
print(f"Shape: {tensor.shape}, dtype: {tensor.dtype}")
1.2 컨볼루션과 주요 필터
컨볼루션(Convolution)은 작은 커널(필터)을 이미지 전체에 슬라이딩하며 특징을 추출합니다.
import torch
import torch.nn.functional as F
# 3x3 Sobel 엣지 검출 커널 (수평 방향)
kernel = torch.tensor([
[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]
], dtype=torch.float32).unsqueeze(0).unsqueeze(0) # [1, 1, 3, 3]
# 그레이스케일 이미지에 컨볼루션 적용
gray_tensor = T.ToTensor()(T.Grayscale()(img)).unsqueeze(0)
edges = F.conv2d(gray_tensor, kernel, padding=1)
| 커널 종류 | 목적 | 활용 사례 |
|---|---|---|
| Sobel | 엣지 검출 (수평/수직) | 자율주행 차선 인식 |
| Gaussian | 블러링, 노이즈 제거 | 이미지 전처리 |
| Laplacian | 엣지 강조 | 선명화 |
| Average | 평균 블러 | 다운샘플링 |
1.3 Albumentations 증강 파이프라인
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
train_transform = A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.4, contrast=0.4,
saturation=0.4, hue=0.1, p=0.8),
A.GaussianBlur(blur_limit=(3, 7), p=0.2),
A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1,
rotate_limit=15, p=0.5),
A.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
image = cv2.cvtColor(cv2.imread("sample.jpg"), cv2.COLOR_BGR2RGB)
augmented = train_transform(image=image)["image"]
2. CNN 아키텍처 발전사
2.1 주요 아키텍처 연대기
| 연도 | 아키텍처 | 핵심 기여 | ImageNet Top-1 |
|---|---|---|---|
| 1998 | LeNet-5 | 최초 실용 CNN, 합성곱+풀링 구조 | - |
| 2012 | AlexNet | ReLU, Dropout, GPU 학습 | 63.3% |
| 2014 | VGGNet | 3x3 커널 깊게 쌓기 | 74.4% |
| 2015 | ResNet-50 | Skip Connection, 잔차 학습 | 76.0% |
| 2017 | DenseNet | 모든 레이어 직접 연결 | 77.4% |
| 2019 | EfficientNet-B7 | 복합 스케일링 | 84.4% |
| 2022 | ConvNeXt-L | Transformer 설계 원칙을 CNN에 적용 | 86.6% |
2.2 ResNet: 잔차 학습의 혁명
ResNet의 핵심은 **skip connection(잔차 연결)**입니다. 입력 x를 출력에 직접 더함으로써 기울기 소실 문제를 해결합니다. 잔차 블록 출력은 F(x) + x로, 미분 시 dF/dx + 1이 되어 최소 기울기 1이 보장됩니다.
import torch
import torch.nn as nn
import torchvision.models as models
class ResNetClassifier(nn.Module):
def __init__(self, num_classes: int, pretrained: bool = True):
super().__init__()
weights = models.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None
self.backbone = models.resnet50(weights=weights)
in_features = self.backbone.fc.in_features
self.backbone.fc = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(in_features, num_classes)
)
def forward(self, x):
return self.backbone(x)
def train_one_epoch(model, loader, optimizer, criterion, device):
model.train()
total_loss, correct = 0.0, 0
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item() * images.size(0)
correct += (outputs.argmax(1) == labels).sum().item()
return total_loss / len(loader.dataset), correct / len(loader.dataset)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ResNetClassifier(num_classes=10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
criterion = nn.CrossEntropyLoss()
2.3 EfficientNet: 복합 스케일링
EfficientNet은 **너비(width), 깊이(depth), 해상도(resolution)**를 함께 스케일링하는 복합 계수(compound coefficient)를 도입했습니다.
import timm
# EfficientNet-B4 파인튜닝
model = timm.create_model(
"efficientnet_b4",
pretrained=True,
num_classes=100,
drop_rate=0.3
)
# 백본 고정 (특징 추출 모드)
for name, param in model.named_parameters():
if "classifier" not in name:
param.requires_grad = False
2.4 ConvNeXt: CNN의 현대화
ConvNeXt는 Transformer의 설계 원칙(큰 커널, LayerNorm, GELU, 더 적은 활성화)을 pure CNN에 적용하여 Swin Transformer와 대등한 성능을 달성했습니다.
model = timm.create_model(
"convnext_large",
pretrained=True,
num_classes=1000
)
# 7x7 depthwise conv, LayerNorm, GELU 활성화 사용
3. 객체 탐지: YOLO, DETR, Faster R-CNN
3.1 탐지 방식 비교
| 방식 | 모델 | 속도 | 정확도 | 특징 |
|---|---|---|---|---|
| Anchor-based 2-stage | Faster R-CNN | 느림 | 높음 | RPN + 분류기 분리 |
| Anchor-based 1-stage | YOLOv8 | 빠름 | 중-높음 | 단일 패스 추론 |
| Anchor-free 1-stage | YOLOv10, FCOS | 매우 빠름 | 높음 | NMS-free, 앵커 제거 |
| Transformer | DETR | 중간 | 높음 | End-to-end, 관계 모델링 |
3.2 YOLOv8 실전 사용
from ultralytics import YOLO
# 사전 학습 모델 로드
model = YOLO("yolov8n.pt") # nano: 속도 최우선
# model = YOLO("yolov8x.pt") # extra-large: 정확도 최우선
# 단일 이미지 추론
results = model("image.jpg", conf=0.25, iou=0.45)
for result in results:
for box in result.boxes:
cls = int(box.cls[0])
conf = float(box.conf[0])
x1, y1, x2, y2 = box.xyxy[0].tolist()
print(f"Class: {result.names[cls]}, Conf: {conf:.2f}")
# 커스텀 데이터셋 파인튜닝
model.train(
data="custom.yaml",
epochs=100,
imgsz=640,
batch=16,
lr0=0.01,
device=0
)
# 평가
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")
3.3 YOLOv10: NMS-Free 탐지
YOLOv10은 Non-Maximum Suppression(NMS) 후처리를 제거하고 **이중 레이블 할당(dual label assignment)**으로 end-to-end 학습을 실현했습니다. 일관성 매칭(consistency matching)으로 one-to-one 예측과 one-to-many 예측을 동시에 활용합니다.
from ultralytics import YOLO
model = YOLO("yolov10n.pt")
# NMS 없이 바로 최종 결과 반환 — 지연 시간 단축
results = model.predict("video.mp4", stream=True)
for frame_result in results:
print(frame_result.boxes)
3.4 DETR: Detection Transformer
DETR은 bipartite matching loss를 통해 anchor와 NMS 없이 직접 최종 박스 집합을 예측합니다.
import torch
from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
img = Image.open("image.jpg")
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
target_sizes = torch.tensor([img.size[::-1]])
results = processor.post_process_object_detection(
outputs, target_sizes=target_sizes, threshold=0.9
)[0]
for score, label, box in zip(
results["scores"], results["labels"], results["boxes"]
):
name = model.config.id2label[label.item()]
print(f"{name}: {score:.3f} at {[round(i,2) for i in box.tolist()]}")
4. 세그멘테이션: DeepLab, Mask R-CNN, SAM
4.1 세그멘테이션 유형
- 시맨틱(Semantic): 픽셀마다 클래스 레이블 (자동차, 도로, 하늘...)
- 인스턴스(Instance): 같은 클래스 내 개별 객체 구분 (자동차1, 자동차2...)
- 파노프틱(Panoptic): 시맨틱 + 인스턴스 통합
4.2 SAM: Segment Anything Model
Meta의 SAM은 **프롬프트(점, 박스, 마스크)**를 받아 임의의 객체를 세그멘테이션합니다. Image Encoder(ViT-H), Prompt Encoder, Mask Decoder의 3-모듈 구조로, SA-1B 데이터셋(1억 개 마스크)으로 학습된 범용 세그멘테이션 모델입니다.
import torch
import numpy as np
import cv2
from segment_anything import sam_model_registry, SamPredictor
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)
# 이미지 설정
image = cv2.cvtColor(cv2.imread("image.jpg"), cv2.COLOR_BGR2RGB)
predictor.set_image(image)
# 포인트 프롬프트로 세그멘테이션
input_point = np.array([[500, 375]]) # 클릭 위치 (x, y)
input_label = np.array([1]) # 1=전경, 0=배경
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True # 여러 마스크 후보 반환
)
best_mask = masks[scores.argmax()]
print(f"Mask shape: {best_mask.shape}, Score: {scores.max():.3f}")
# 박스 프롬프트
input_box = np.array([100, 100, 400, 400])
masks_box, _, _ = predictor.predict(
box=input_box[None, :],
multimask_output=False
)
4.3 DeepLabV3+ 시맨틱 세그멘테이션
DeepLabV3+는 ASPP(Atrous Spatial Pyramid Pooling)로 다양한 스케일의 문맥 정보를 포착합니다.
import torch
import torchvision.models.segmentation as seg_models
model = seg_models.deeplabv3_resnet101(
weights=seg_models.DeepLabV3_ResNet101_Weights.COCO_WITH_VOC_LABELS_V1
)
model.eval()
with torch.no_grad():
output = model(tensor.unsqueeze(0))["out"] # [1, num_classes, H, W]
pred = output.argmax(dim=1).squeeze() # [H, W]
print(f"Segmentation map shape: {pred.shape}")
4.4 Mask R-CNN 인스턴스 세그멘테이션
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection import MaskRCNN_ResNet50_FPN_V2_Weights
from torchvision.utils import draw_segmentation_masks
import torchvision.transforms.functional as TF
from PIL import Image
weights = MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = maskrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.75)
model.eval()
img = Image.open("people.jpg").convert("RGB")
inp = weights.transforms()(img).unsqueeze(0)
with torch.no_grad():
predictions = model(inp)
pred = predictions[0]
masks = (pred["masks"] > 0.5).squeeze(1)
img_uint8 = (TF.to_tensor(img) * 255).byte()
result = draw_segmentation_masks(img_uint8, masks, alpha=0.5)
5. Vision Transformer: ViT, Swin, DINOv2
5.1 ViT 핵심 원리
ViT(Vision Transformer)는 이미지를 **고정 크기 패치(16x16)**로 분할하고, 각 패치를 토큰으로 취급해 Transformer에 입력합니다. CNN과 달리 locality inductive bias 없이 전역 관계를 학습합니다.
import timm
import torch
# ViT-Base/16 파인튜닝
model = timm.create_model(
"vit_base_patch16_224",
pretrained=True,
num_classes=10,
img_size=224
)
# ViT는 더 강한 증강 + AdamW + cosine schedule이 효과적
data_config = timm.data.resolve_model_data_config(model)
transforms_train = timm.data.create_transform(**data_config, is_training=True)
transforms_val = timm.data.create_transform(**data_config, is_training=False)
optimizer = torch.optim.AdamW(
model.parameters(), lr=1e-3, weight_decay=0.05
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100
)
5.2 Swin Transformer: 계층적 ViT
Swin Transformer는 계층적 특징 맵과 Shifted Window Attention으로 CNN의 locality를 ViT에 도입했습니다. 각 스테이지에서 해상도를 반씩 줄이고 채널을 늘려 FPN과 호환됩니다.
| 모델 | 해상도 | 파라미터 | ImageNet Top-1 |
|---|---|---|---|
| ViT-B/16 | 224 | 86M | 81.8% |
| Swin-T | 224 | 28M | 81.3% |
| Swin-B | 224 | 88M | 83.5% |
| Swin-L | 384 | 197M | 87.3% |
| DINOv2-L | 518 | 307M | 86.3% |
5.3 DINOv2: 자기지도 학습의 정점
DINOv2는 레이블 없이 대규모 이미지 데이터로 학습한 범용 비전 인코더입니다. self-supervised learning으로 ImageNet 지도 학습 모델을 능가합니다.
import torch
# DINOv2 특징 추출기 (학습 없이 바로 사용)
dinov2 = torch.hub.load("facebookresearch/dinov2", "dinov2_vitl14")
dinov2.eval().cuda()
# 임의 크기 이미지 처리
import torchvision.transforms as T
preprocess = T.Compose([
T.Resize(518),
T.CenterCrop(518),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
with torch.no_grad():
# [B, 3, 518, 518] -> [B, 1024] 특징
features = dinov2(preprocess(img).unsqueeze(0).cuda())
print(f"Feature dim: {features.shape}") # [1, 1024]
# DINOv2는 k-NN 분류기로도 강력한 성능을 보임
6. 생성 모델: GAN, Diffusion, ControlNet
6.1 생성 모델 비교
| 생성 모델 | 대표 모델 | 학습 방식 | 장점 | 단점 |
|---|---|---|---|---|
| GAN | StyleGAN3 | 적대적 학습 | 고속 생성 | 학습 불안정, 모드 붕괴 |
| VAE | VQ-VAE-2 | 재구성 + KL | 안정적 학습 | 흐릿한 이미지 |
| Diffusion | DDPM, DDIM | 노이즈 제거 | 품질 최고 | 느린 생성 |
| LDM | Stable Diffusion | 잠재 공간 확산 | 품질+속도 균형 | 높은 GPU 메모리 |
6.2 Stable Diffusion: 잠재 확산 모델
Stable Diffusion은 **Latent Diffusion Model(LDM)**입니다. U-Net이 잠재 공간에서 노이즈를 단계별로 제거합니다.
- Forward process: 이미지에 가우시안 노이즈를 T 스텝에 걸쳐 추가
- Reverse process: U-Net이 noisy latent z_t, timestep t, 텍스트 임베딩을 받아 노이즈 epsilon 예측
- VAE decoder: 최종 잠재 벡터를 픽셀 공간으로 복원
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
# SDXL 1.0 텍스트-이미지 생성
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config
)
pipe.to("cuda")
image = pipe(
prompt="a photorealistic cat on a desk, 8k, studio lighting",
negative_prompt="blurry, low quality, cartoon",
num_inference_steps=25,
guidance_scale=7.5,
width=1024,
height=1024
).images[0]
image.save("generated.png")
6.3 ControlNet: 구조 조건부 생성
ControlNet은 엣지맵, 깊이맵, 포즈 등 구조적 조건으로 이미지 생성을 정밀 제어합니다.
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import cv2
import numpy as np
from PIL import Image
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny",
torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16
).to("cuda")
# Canny 엣지맵 생성
ref_image = np.array(load_image("reference.jpg"))
edges = cv2.Canny(ref_image, 100, 200)
edges_rgb = np.stack([edges] * 3, axis=-1)
result = pipe(
"a beautiful landscape painting",
image=Image.fromarray(edges_rgb),
num_inference_steps=20
).images[0]
result.save("controlnet_output.png")
7. 실전 파이프라인: DataLoader부터 TensorRT까지
7.1 커스텀 Dataset 및 DataLoader
from torch.utils.data import Dataset, DataLoader
from pathlib import Path
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
class CustomDataset(Dataset):
def __init__(self, root: str, split: str = "train"):
self.root = Path(root)
self.image_paths = list(
(self.root / split / "images").glob("*.jpg")
)
self.label_paths = [
self.root / split / "labels" / p.with_suffix(".txt").name
for p in self.image_paths
]
self.transform = self._get_transforms(split)
def _get_transforms(self, split: str):
if split == "train":
return A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.4, contrast=0.4,
saturation=0.4, hue=0.1, p=0.8),
A.GaussianBlur(blur_limit=(3, 7), p=0.2),
A.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
return A.Compose([
A.Resize(256, 256),
A.CenterCrop(224, 224),
A.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
image = cv2.cvtColor(
cv2.imread(str(self.image_paths[idx])), cv2.COLOR_BGR2RGB
)
label = int(self.label_paths[idx].read_text().strip())
augmented = self.transform(image=image)
return augmented["image"], label
train_dataset = CustomDataset("data/", split="train")
train_loader = DataLoader(
train_dataset, batch_size=32, shuffle=True,
num_workers=4, pin_memory=True, persistent_workers=True
)
7.2 ONNX 변환
import torch
import torch.onnx
model.eval()
dummy_input = torch.randn(1, 3, 224, 224, device="cuda")
torch.onnx.export(
model,
dummy_input,
"model.onnx",
export_params=True,
opset_version=17,
do_constant_folding=True,
input_names=["input"],
output_names=["output"],
dynamic_axes={
"input": {0: "batch_size"},
"output": {0: "batch_size"}
}
)
print("ONNX 변환 완료: model.onnx")
# ONNX Runtime으로 검증
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession(
"model.onnx", providers=["CUDAExecutionProvider"]
)
input_name = sess.get_inputs()[0].name
result = sess.run(None, {input_name: dummy_input.cpu().numpy()})
print(f"ONNX 추론 결과 shape: {result[0].shape}")
7.3 TensorRT 최적화
# TensorRT 변환 (trtexec 사용)
trtexec \
--onnx=model.onnx \
--saveEngine=model_fp16.engine \
--fp16 \
--workspace=4096 \
--optShapes=input:8x3x224x224
import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
def trt_inference(engine_path: str, input_data: np.ndarray) -> np.ndarray:
logger = trt.Logger(trt.Logger.WARNING)
with open(engine_path, "rb") as f:
engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
input_mem = cuda.mem_alloc(input_data.nbytes)
output = np.empty((input_data.shape[0], 1000), dtype=np.float32)
output_mem = cuda.mem_alloc(output.nbytes)
cuda.memcpy_htod(input_mem, input_data)
context.execute_v2([int(input_mem), int(output_mem)])
cuda.memcpy_dtoh(output, output_mem)
return output
퀴즈: 컴퓨터 비전 심화 이해
Q1. ResNet의 skip connection이 vanishing gradient를 해결하는 원리는?
정답: 역전파 시 기울기가 skip connection을 통해 직접 전달되어 깊은 레이어에서도 소실 없이 흐릅니다.
설명: 일반 깊은 신경망에서 역전파 기울기는 레이어를 거칠수록 곱셈으로 점점 작아집니다. ResNet의 잔차 블록 F(x) + x는 미분 시 dF/dx + 1이 되어, 최소 기울기 1이 항상 보장됩니다. 이 덕분에 100층 이상도 안정적으로 학습할 수 있습니다.
Q2. YOLO가 Faster R-CNN보다 실시간 추론에 적합한 이유는?
정답: YOLO는 단일 패스(single forward pass)로 탐지를 완료하지만, Faster R-CNN은 Region Proposal Network와 분류기 두 단계가 필요합니다.
설명: Faster R-CNN은 (1) RPN으로 후보 영역 생성, (2) RoI Pooling, (3) 분류 및 회귀의 다단계 구조입니다. YOLO는 이미지를 그리드로 나눠 한 번의 CNN 패스로 모든 박스와 클래스를 동시 예측합니다. YOLOv8n은 A100 GPU에서 80+ FPS 달성이 가능합니다.
Q3. Vision Transformer가 CNN보다 대규모 데이터에서 더 좋은 성능을 내는 이유는?
정답: ViT의 Self-Attention은 모든 패치 간 전역적 관계를 학습하여 inductive bias 없이 데이터에서 직접 최적 표현을 학습합니다.
설명: CNN은 locality(지역성)와 translation equivariance라는 inductive bias를 내장합니다. 적은 데이터에서는 이 편향이 도움이 되지만, 대규모 데이터(JFT-300M 등)에서는 표현력을 제한합니다. ViT는 편향 없이 전역 패턴을 자유롭게 학습하므로 데이터가 충분할 때 CNN을 압도합니다.
Q4. Stable Diffusion의 denoising diffusion process에서 U-Net의 역할은?
정답: U-Net은 각 timestep에서 잠재 벡터에 추가된 노이즈를 예측하여 제거하며, 텍스트 조건(CLIP 임베딩)을 cross-attention으로 통합합니다.
설명: 순전파(forward process)에서 이미지에 가우시안 노이즈를 T 스텝에 걸쳐 추가합니다. 역전파(reverse process)에서 U-Net은 noisy latent z_t와 timestep t, 텍스트 임베딩을 입력받아 노이즈 성분 epsilon을 예측합니다. VAE 디코더로 최종 잠재 벡터를 픽셀 공간으로 복원합니다.
Q5. SAM의 prompt-based segmentation이 기존 방식과 다른 점은?
정답: SAM은 점, 박스, 마스크 등 다양한 프롬프트를 받아 학습 없이 임의의 객체를 제로샷으로 세그멘테이션합니다.
설명: 기존 세그멘테이션 모델(DeepLab, Mask R-CNN)은 특정 클래스 집합에 대해 지도 학습됩니다. SAM은 SA-1B 데이터셋(1억 개 마스크)으로 학습한 범용 모델로, 클래스에 구애받지 않고 사용자가 지정한 영역을 분리합니다. Image Encoder(ViT-H), Prompt Encoder, Mask Decoder의 3-모듈 구조로 되어 있습니다.
마무리: 학습 로드맵
컴퓨터 비전은 빠르게 발전하는 분야입니다. 추천 학습 경로:
- 기초: OpenCV, NumPy 이미지 처리 → torchvision 실습
- 분류: ResNet/EfficientNet 파인튜닝 → 커스텀 데이터셋 적용
- 탐지: YOLOv8 실험 → 커스텀 학습 → ONNX/TensorRT 배포
- 세그멘테이션: SAM 실험 → Mask R-CNN/DeepLabV3+ 커스텀 학습
- 고급: ViT/DINOv2 특징 추출 → Stable Diffusion 파인튜닝
각 단계마다 Kaggle 대회나 실제 프로젝트로 적용하는 것이 가장 빠른 학습 방법입니다.
Computer Vision Complete Guide: CNN, ViT, YOLO, and Stable Diffusion
- Introduction
- 1. Image Fundamentals: Pixels, Channels, Convolution
- 2. CNN Architecture Evolution
- 3. Object Detection: YOLO, DETR, Faster R-CNN
- 4. Segmentation: DeepLab, Mask R-CNN, SAM
- 5. Vision Transformer: ViT, Swin, DINOv2
- 6. Generative Models: GAN, Diffusion, ControlNet
- 7. Production Pipeline: DataLoader to TensorRT
- Quiz: Deep Understanding Check
- Wrap-Up: Learning Roadmap
Introduction
Computer Vision is the field of AI that enables machines to understand images and video. From smartphone face unlock to autonomous vehicle obstacle detection to medical imaging assistance — computer vision powers them all.
This guide takes you from pixel-level fundamentals through Vision Transformers, Stable Diffusion, and production deployment with PyTorch code examples throughout.
1. Image Fundamentals: Pixels, Channels, Convolution
1.1 Digital Image Structure
A digital image is a 2D grid of pixels.
- Grayscale image: 2D array of shape H x W, pixel values 0–255
- Color image (RGB): 3D tensor of shape H x W x 3, each channel 0–255
- Resolution: image dimensions (e.g., 1920x1080), pixel density (DPI)
import torch
import torchvision.transforms as T
from PIL import Image
# Load image and convert to tensor
img = Image.open("sample.jpg").convert("RGB")
transform = T.Compose([
T.Resize((224, 224)),
T.ToTensor(), # [0,255] -> [0.0,1.0], HWC -> CHW
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
tensor = transform(img) # shape: [3, 224, 224]
print(f"Shape: {tensor.shape}, dtype: {tensor.dtype}")
1.2 Convolution and Common Filters
Convolution slides a small kernel (filter) across the entire image to extract features.
import torch
import torch.nn.functional as F
# 3x3 Sobel edge detection kernel (horizontal)
kernel = torch.tensor([
[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]
], dtype=torch.float32).unsqueeze(0).unsqueeze(0) # [1, 1, 3, 3]
# Apply convolution to a grayscale image
gray_tensor = T.ToTensor()(T.Grayscale()(img)).unsqueeze(0)
edges = F.conv2d(gray_tensor, kernel, padding=1)
| Kernel Type | Purpose | Use Case |
|---|---|---|
| Sobel | Edge detection (H/V) | Lane detection in self-driving |
| Gaussian | Blurring, noise removal | Image pre-processing |
| Laplacian | Edge sharpening | Image enhancement |
| Average | Mean blur | Downsampling |
1.3 Albumentations Augmentation Pipeline
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
train_transform = A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.4, contrast=0.4,
saturation=0.4, hue=0.1, p=0.8),
A.GaussianBlur(blur_limit=(3, 7), p=0.2),
A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1,
rotate_limit=15, p=0.5),
A.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
image = cv2.cvtColor(cv2.imread("sample.jpg"), cv2.COLOR_BGR2RGB)
augmented = train_transform(image=image)["image"]
2. CNN Architecture Evolution
2.1 Major Architecture Timeline
| Year | Architecture | Key Contribution | ImageNet Top-1 |
|---|---|---|---|
| 1998 | LeNet-5 | First practical CNN, conv+pool structure | - |
| 2012 | AlexNet | ReLU, Dropout, GPU training | 63.3% |
| 2014 | VGGNet | Stacking deep 3x3 kernels | 74.4% |
| 2015 | ResNet-50 | Skip connections, residual learning | 76.0% |
| 2017 | DenseNet | Direct connections between all layers | 77.4% |
| 2019 | EfficientNet-B7 | Compound scaling | 84.4% |
| 2022 | ConvNeXt-L | Transformer design principles in CNN | 86.6% |
2.2 ResNet: The Residual Learning Revolution
ResNet's key innovation is the skip connection (residual connection). Adding input x directly to the output resolves the vanishing gradient problem. The residual block output F(x) + x has gradient dF/dx + 1, guaranteeing a minimum gradient of 1 through any residual path.
import torch
import torch.nn as nn
import torchvision.models as models
class ResNetClassifier(nn.Module):
def __init__(self, num_classes: int, pretrained: bool = True):
super().__init__()
weights = models.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None
self.backbone = models.resnet50(weights=weights)
in_features = self.backbone.fc.in_features
self.backbone.fc = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(in_features, num_classes)
)
def forward(self, x):
return self.backbone(x)
def train_one_epoch(model, loader, optimizer, criterion, device):
model.train()
total_loss, correct = 0.0, 0
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item() * images.size(0)
correct += (outputs.argmax(1) == labels).sum().item()
return total_loss / len(loader.dataset), correct / len(loader.dataset)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ResNetClassifier(num_classes=10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
criterion = nn.CrossEntropyLoss()
2.3 EfficientNet: Compound Scaling
EfficientNet introduces a compound coefficient that scales width, depth, and resolution together with a fixed ratio, achieving superior accuracy/efficiency trade-offs.
import timm
# EfficientNet-B4 fine-tuning
model = timm.create_model(
"efficientnet_b4",
pretrained=True,
num_classes=100,
drop_rate=0.3
)
# Freeze backbone (feature extraction mode)
for name, param in model.named_parameters():
if "classifier" not in name:
param.requires_grad = False
2.4 ConvNeXt: Modernizing CNN
ConvNeXt applies Transformer design principles (large kernels, LayerNorm, GELU, fewer activations) to a pure CNN, matching Swin Transformer performance with comparable speed.
model = timm.create_model(
"convnext_large",
pretrained=True,
num_classes=1000
)
# Uses 7x7 depthwise conv, LayerNorm, GELU activation
3. Object Detection: YOLO, DETR, Faster R-CNN
3.1 Detection Approach Comparison
| Approach | Model | Speed | Accuracy | Notes |
|---|---|---|---|---|
| Anchor-based 2-stage | Faster R-CNN | Slow | High | RPN + classifier separated |
| Anchor-based 1-stage | YOLOv8 | Fast | Med-High | Single pass inference |
| Anchor-free 1-stage | YOLOv10, FCOS | Very fast | High | NMS-free, no anchors |
| Transformer | DETR | Medium | High | End-to-end, relational modeling |
3.2 YOLOv8 in Practice
from ultralytics import YOLO
# Load pretrained model
model = YOLO("yolov8n.pt") # nano: speed first
# model = YOLO("yolov8x.pt") # extra-large: accuracy first
# Single image inference
results = model("image.jpg", conf=0.25, iou=0.45)
for result in results:
for box in result.boxes:
cls = int(box.cls[0])
conf = float(box.conf[0])
x1, y1, x2, y2 = box.xyxy[0].tolist()
print(f"Class: {result.names[cls]}, Conf: {conf:.2f}")
# Fine-tune on custom dataset
model.train(
data="custom.yaml",
epochs=100,
imgsz=640,
batch=16,
lr0=0.01,
device=0
)
# Evaluation
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")
3.3 YOLOv10: NMS-Free Detection
YOLOv10 eliminates the Non-Maximum Suppression (NMS) post-processing step through dual label assignment and consistency matching, enabling true end-to-end inference with reduced latency.
from ultralytics import YOLO
model = YOLO("yolov10n.pt")
# Returns final detections directly without NMS — lower latency
results = model.predict("video.mp4", stream=True)
for frame_result in results:
print(frame_result.boxes)
3.4 DETR: Detection Transformer
DETR uses bipartite matching loss to predict the final set of boxes directly, eliminating both anchors and NMS.
import torch
from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
img = Image.open("image.jpg")
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)
target_sizes = torch.tensor([img.size[::-1]])
results = processor.post_process_object_detection(
outputs, target_sizes=target_sizes, threshold=0.9
)[0]
for score, label, box in zip(
results["scores"], results["labels"], results["boxes"]
):
name = model.config.id2label[label.item()]
print(f"{name}: {score:.3f} at {[round(i, 2) for i in box.tolist()]}")
4. Segmentation: DeepLab, Mask R-CNN, SAM
4.1 Segmentation Task Types
- Semantic: Class label per pixel (car, road, sky...)
- Instance: Distinguishes individual objects of the same class (car1, car2...)
- Panoptic: Combines semantic and instance segmentation
4.2 SAM: Segment Anything Model
Meta's SAM accepts prompts (points, boxes, masks) and segments any object zero-shot. It uses a 3-module architecture: Image Encoder (ViT-H), Prompt Encoder, and Mask Decoder, trained on the SA-1B dataset (1 billion masks).
import torch
import numpy as np
import cv2
from segment_anything import sam_model_registry, SamPredictor
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)
# Set image
image = cv2.cvtColor(cv2.imread("image.jpg"), cv2.COLOR_BGR2RGB)
predictor.set_image(image)
# Point-prompted segmentation
input_point = np.array([[500, 375]]) # click location (x, y)
input_label = np.array([1]) # 1=foreground, 0=background
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True # return multiple candidate masks
)
best_mask = masks[scores.argmax()]
print(f"Mask shape: {best_mask.shape}, Score: {scores.max():.3f}")
# Box-prompted segmentation
input_box = np.array([100, 100, 400, 400])
masks_box, _, _ = predictor.predict(
box=input_box[None, :],
multimask_output=False
)
4.3 DeepLabV3+ Semantic Segmentation
DeepLabV3+ uses ASPP (Atrous Spatial Pyramid Pooling) to capture multi-scale context information.
import torch
import torchvision.models.segmentation as seg_models
model = seg_models.deeplabv3_resnet101(
weights=seg_models.DeepLabV3_ResNet101_Weights.COCO_WITH_VOC_LABELS_V1
)
model.eval()
with torch.no_grad():
output = model(tensor.unsqueeze(0))["out"] # [1, num_classes, H, W]
pred = output.argmax(dim=1).squeeze() # [H, W]
print(f"Segmentation map shape: {pred.shape}")
4.4 Mask R-CNN Instance Segmentation
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection import MaskRCNN_ResNet50_FPN_V2_Weights
from torchvision.utils import draw_segmentation_masks
import torchvision.transforms.functional as TF
from PIL import Image
weights = MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = maskrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.75)
model.eval()
img = Image.open("people.jpg").convert("RGB")
inp = weights.transforms()(img).unsqueeze(0)
with torch.no_grad():
predictions = model(inp)
pred = predictions[0]
masks = (pred["masks"] > 0.5).squeeze(1)
img_uint8 = (TF.to_tensor(img) * 255).byte()
result = draw_segmentation_masks(img_uint8, masks, alpha=0.5)
5. Vision Transformer: ViT, Swin, DINOv2
5.1 ViT Core Concept
ViT (Vision Transformer) splits images into fixed-size patches (16x16) and treats each patch as a token fed into a Transformer. Unlike CNN, it learns global relationships without locality inductive bias.
import timm
import torch
# ViT-Base/16 fine-tuning
model = timm.create_model(
"vit_base_patch16_224",
pretrained=True,
num_classes=10,
img_size=224
)
# ViT benefits from strong augmentation + AdamW + cosine schedule
data_config = timm.data.resolve_model_data_config(model)
transforms_train = timm.data.create_transform(**data_config, is_training=True)
transforms_val = timm.data.create_transform(**data_config, is_training=False)
optimizer = torch.optim.AdamW(
model.parameters(), lr=1e-3, weight_decay=0.05
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100
)
5.2 Swin Transformer: Hierarchical ViT
Swin Transformer introduces hierarchical feature maps and Shifted Window Attention, bringing CNN locality into ViT. Each stage halves resolution and doubles channels, making it FPN-compatible.
| Model | Resolution | Parameters | ImageNet Top-1 |
|---|---|---|---|
| ViT-B/16 | 224 | 86M | 81.8% |
| Swin-T | 224 | 28M | 81.3% |
| Swin-B | 224 | 88M | 83.5% |
| Swin-L | 384 | 197M | 87.3% |
| DINOv2-L | 518 | 307M | 86.3% |
5.3 DINOv2: Self-Supervised Learning
DINOv2 is a general-purpose vision encoder trained on large-scale image data without labels, surpassing supervised ImageNet models via self-supervised learning.
import torch
import torchvision.transforms as T
# DINOv2 feature extractor (ready to use without fine-tuning)
dinov2 = torch.hub.load("facebookresearch/dinov2", "dinov2_vitl14")
dinov2.eval().cuda()
preprocess = T.Compose([
T.Resize(518),
T.CenterCrop(518),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
with torch.no_grad():
# [B, 3, 518, 518] -> [B, 1024] features
features = dinov2(preprocess(img).unsqueeze(0).cuda())
print(f"Feature dim: {features.shape}") # [1, 1024]
# DINOv2 achieves strong performance even with k-NN classifiers
6. Generative Models: GAN, Diffusion, ControlNet
6.1 Generative Model Comparison
| Model Type | Representative | Training | Strengths | Weaknesses |
|---|---|---|---|---|
| GAN | StyleGAN3 | Adversarial | Fast generation | Unstable training, mode collapse |
| VAE | VQ-VAE-2 | Recon + KL | Stable training | Blurry images |
| Diffusion | DDPM, DDIM | Denoising | Best quality | Slow generation |
| LDM | Stable Diffusion | Latent diffusion | Quality + speed | High GPU memory |
6.2 Stable Diffusion: Latent Diffusion Model
Stable Diffusion is a Latent Diffusion Model (LDM). The U-Net progressively removes noise in latent space.
- Forward process: Add Gaussian noise to the image over T steps
- Reverse process: U-Net predicts noise epsilon from noisy latent z_t, timestep t, and text embedding
- VAE decoder: Restores the final latent vector to pixel space
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
# SDXL 1.0 text-to-image generation
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config
)
pipe.to("cuda")
image = pipe(
prompt="a photorealistic cat on a desk, 8k, studio lighting",
negative_prompt="blurry, low quality, cartoon",
num_inference_steps=25,
guidance_scale=7.5,
width=1024,
height=1024
).images[0]
image.save("generated.png")
6.3 ControlNet: Structure-Conditioned Generation
ControlNet adds precise structural control to image generation using edge maps, depth maps, poses, and other spatial conditions.
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import cv2
import numpy as np
from PIL import Image
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny",
torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16
).to("cuda")
# Generate Canny edge map
ref_image = np.array(load_image("reference.jpg"))
edges = cv2.Canny(ref_image, 100, 200)
edges_rgb = np.stack([edges] * 3, axis=-1)
result = pipe(
"a beautiful landscape painting",
image=Image.fromarray(edges_rgb),
num_inference_steps=20
).images[0]
result.save("controlnet_output.png")
7. Production Pipeline: DataLoader to TensorRT
7.1 Custom Dataset and DataLoader
from torch.utils.data import Dataset, DataLoader
from pathlib import Path
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
class CustomDataset(Dataset):
def __init__(self, root: str, split: str = "train"):
self.root = Path(root)
self.image_paths = list(
(self.root / split / "images").glob("*.jpg")
)
self.label_paths = [
self.root / split / "labels" / p.with_suffix(".txt").name
for p in self.image_paths
]
self.transform = self._get_transforms(split)
def _get_transforms(self, split: str):
if split == "train":
return A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.4, contrast=0.4,
saturation=0.4, hue=0.1, p=0.8),
A.GaussianBlur(blur_limit=(3, 7), p=0.2),
A.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
return A.Compose([
A.Resize(256, 256),
A.CenterCrop(224, 224),
A.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
image = cv2.cvtColor(
cv2.imread(str(self.image_paths[idx])), cv2.COLOR_BGR2RGB
)
label = int(self.label_paths[idx].read_text().strip())
augmented = self.transform(image=image)
return augmented["image"], label
train_dataset = CustomDataset("data/", split="train")
train_loader = DataLoader(
train_dataset, batch_size=32, shuffle=True,
num_workers=4, pin_memory=True, persistent_workers=True
)
7.2 ONNX Export
import torch
import torch.onnx
model.eval()
dummy_input = torch.randn(1, 3, 224, 224, device="cuda")
torch.onnx.export(
model,
dummy_input,
"model.onnx",
export_params=True,
opset_version=17,
do_constant_folding=True,
input_names=["input"],
output_names=["output"],
dynamic_axes={
"input": {0: "batch_size"},
"output": {0: "batch_size"}
}
)
print("ONNX export complete: model.onnx")
# Verify with ONNX Runtime
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession(
"model.onnx", providers=["CUDAExecutionProvider"]
)
input_name = sess.get_inputs()[0].name
result = sess.run(None, {input_name: dummy_input.cpu().numpy()})
print(f"ONNX inference result shape: {result[0].shape}")
7.3 TensorRT Optimization
# TensorRT conversion using trtexec
trtexec \
--onnx=model.onnx \
--saveEngine=model_fp16.engine \
--fp16 \
--workspace=4096 \
--optShapes=input:8x3x224x224
import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
def trt_inference(engine_path: str, input_data: np.ndarray) -> np.ndarray:
logger = trt.Logger(trt.Logger.WARNING)
with open(engine_path, "rb") as f:
engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
input_mem = cuda.mem_alloc(input_data.nbytes)
output = np.empty((input_data.shape[0], 1000), dtype=np.float32)
output_mem = cuda.mem_alloc(output.nbytes)
cuda.memcpy_htod(input_mem, input_data)
context.execute_v2([int(input_mem), int(output_mem)])
cuda.memcpy_dtoh(output, output_mem)
return output
Quiz: Deep Understanding Check
Q1. How do ResNet skip connections solve the vanishing gradient problem?
Answer: During backpropagation, gradients flow directly through the skip connection, preventing them from vanishing in deep layers.
Explanation: In a plain deep network, backpropagated gradients shrink multiplicatively through each layer. ResNet's residual block F(x) + x differentiates to dF/dx + 1, guaranteeing a minimum gradient of 1 through any residual path. This enables stable training even beyond 100 layers.
Q2. Why is YOLO better suited for real-time inference than Faster R-CNN?
Answer: YOLO completes detection in a single forward pass, while Faster R-CNN requires two stages: a Region Proposal Network and a separate classifier.
Explanation: Faster R-CNN follows a pipeline of (1) region proposals via RPN, (2) RoI Pooling, and (3) classification and bounding-box regression. YOLO divides the image into a grid and simultaneously predicts all boxes and classes in one CNN pass. YOLOv8n achieves 80+ FPS on an A100 GPU, making it suitable for real-time applications.
Q3. Why does Vision Transformer outperform CNN on large-scale data?
Answer: ViT's Self-Attention learns global relationships between all patches without inductive bias, discovering optimal representations directly from data.
Explanation: CNNs encode locality and translation equivariance as inductive biases. These biases help with small datasets but limit representational capacity at scale. Given sufficient data (e.g., JFT-300M), ViT freely learns global patterns without these constraints, surpassing CNNs in accuracy.
Q4. What is the U-Net's role in Stable Diffusion's denoising diffusion process?
Answer: The U-Net predicts and removes the noise added to the latent vector at each timestep, and integrates the text condition (CLIP embedding) via cross-attention.
Explanation: In the forward process, Gaussian noise is added to the image latent over T steps. In the reverse process, the U-Net receives the noisy latent z_t, timestep t, and text embedding, and predicts the noise component epsilon. The VAE decoder then reconstructs the final image from the denoised latent.
Q5. How does SAM's prompt-based segmentation differ from conventional methods?
Answer: SAM segments arbitrary objects zero-shot from various prompts (points, boxes, masks) without task-specific training.
Explanation: Traditional segmentation models (DeepLab, Mask R-CNN) are trained with supervision on a fixed set of classes. SAM is a general-purpose model trained on the SA-1B dataset (1 billion masks), which segments any region specified by the user regardless of class. Its 3-module architecture — Image Encoder (ViT-H), Prompt Encoder, and Mask Decoder — separates encoding from flexible prompt conditioning.
Wrap-Up: Learning Roadmap
Computer vision evolves rapidly. Here is a recommended learning path:
- Foundations: OpenCV, NumPy image manipulation → torchvision hands-on
- Classification: ResNet/EfficientNet fine-tuning → custom dataset
- Detection: YOLOv8 experiments → custom training → ONNX/TensorRT deployment
- Segmentation: SAM exploration → Mask R-CNN / DeepLabV3+ custom training
- Advanced: ViT/DINOv2 feature extraction → Stable Diffusion fine-tuning
The fastest path to mastery is applying each concept in Kaggle competitions or real-world projects. Welcome to the world of computer vision!