Split View: 컴퓨터 비전 완전 정복: CNN부터 ViT, YOLO, Stable Diffusion까지

컴퓨터 비전 완전 정복: CNN부터 ViT, YOLO, Stable Diffusion까지

들어가며
1. 이미지 기초: 픽셀, 채널, 컨볼루션
2. CNN 아키텍처 발전사
3. 객체 탐지: YOLO, DETR, Faster R-CNN
4. 세그멘테이션: DeepLab, Mask R-CNN, SAM
5. Vision Transformer: ViT, Swin, DINOv2
6. 생성 모델: GAN, Diffusion, ControlNet
7. 실전 파이프라인: DataLoader부터 TensorRT까지
퀴즈: 컴퓨터 비전 심화 이해
마무리: 학습 로드맵

들어가며

컴퓨터 비전(Computer Vision)은 기계가 이미지와 동영상을 이해하는 AI의 핵심 분야입니다. 스마트폰 얼굴 잠금 해제, 자율주행 차량의 장애물 인식, 의료 영상 진단 보조 — 모두 컴퓨터 비전 기술이 작동하고 있습니다.

이 가이드는 픽셀 수준의 기초부터 최신 Vision Transformer, Stable Diffusion까지 체계적으로 다룹니다.

1. 이미지 기초: 픽셀, 채널, 컨볼루션

1.1 디지털 이미지의 구조

디지털 이미지는 픽셀(pixel)의 2D 격자입니다.

그레이스케일 이미지: H x W 형태의 2D 배열, 각 픽셀 값 0~255
컬러 이미지(RGB): H x W x 3 형태의 3D 텐서, R/G/B 각 채널 0~255
해상도: 이미지 크기 (예: 1920x1080), 픽셀 밀도(DPI)

import torch
import torchvision.transforms as T
from PIL import Image

# 이미지 로드 및 텐서 변환
img = Image.open("sample.jpg").convert("RGB")
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),          # [0,255] -> [0.0,1.0], HWC -> CHW
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
])
tensor = transform(img)   # shape: [3, 224, 224]
print(f"Shape: {tensor.shape}, dtype: {tensor.dtype}")

1.2 컨볼루션과 주요 필터

컨볼루션(Convolution)은 작은 커널(필터)을 이미지 전체에 슬라이딩하며 특징을 추출합니다.

import torch
import torch.nn.functional as F

# 3x3 Sobel 엣지 검출 커널 (수평 방향)
kernel = torch.tensor([
    [-1, 0, 1],
    [-2, 0, 2],
    [-1, 0, 1]
], dtype=torch.float32).unsqueeze(0).unsqueeze(0)  # [1, 1, 3, 3]

# 그레이스케일 이미지에 컨볼루션 적용
gray_tensor = T.ToTensor()(T.Grayscale()(img)).unsqueeze(0)
edges = F.conv2d(gray_tensor, kernel, padding=1)

커널 종류	목적	활용 사례
Sobel	엣지 검출 (수평/수직)	자율주행 차선 인식
Gaussian	블러링, 노이즈 제거	이미지 전처리
Laplacian	엣지 강조	선명화
Average	평균 블러	다운샘플링

1.3 Albumentations 증강 파이프라인

import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2

train_transform = A.Compose([
    A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.4, contrast=0.4,
                  saturation=0.4, hue=0.1, p=0.8),
    A.GaussianBlur(blur_limit=(3, 7), p=0.2),
    A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1,
                       rotate_limit=15, p=0.5),
    A.Normalize(mean=(0.485, 0.456, 0.406),
                std=(0.229, 0.224, 0.225)),
    ToTensorV2(),
])

image = cv2.cvtColor(cv2.imread("sample.jpg"), cv2.COLOR_BGR2RGB)
augmented = train_transform(image=image)["image"]

2. CNN 아키텍처 발전사

2.1 주요 아키텍처 연대기

연도	아키텍처	핵심 기여	ImageNet Top-1
1998	LeNet-5	최초 실용 CNN, 합성곱+풀링 구조	-
2012	AlexNet	ReLU, Dropout, GPU 학습	63.3%
2014	VGGNet	3x3 커널 깊게 쌓기	74.4%
2015	ResNet-50	Skip Connection, 잔차 학습	76.0%
2017	DenseNet	모든 레이어 직접 연결	77.4%
2019	EfficientNet-B7	복합 스케일링	84.4%
2022	ConvNeXt-L	Transformer 설계 원칙을 CNN에 적용	86.6%

2.2 ResNet: 잔차 학습의 혁명

ResNet의 핵심은 **skip connection(잔차 연결)**입니다. 입력 x를 출력에 직접 더함으로써 기울기 소실 문제를 해결합니다. 잔차 블록 출력은 F(x) + x로, 미분 시 dF/dx + 1이 되어 최소 기울기 1이 보장됩니다.

import torch
import torch.nn as nn
import torchvision.models as models

class ResNetClassifier(nn.Module):
    def __init__(self, num_classes: int, pretrained: bool = True):
        super().__init__()
        weights = models.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None
        self.backbone = models.resnet50(weights=weights)
        in_features = self.backbone.fc.in_features
        self.backbone.fc = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(in_features, num_classes)
        )

    def forward(self, x):
        return self.backbone(x)

def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss, correct = 0.0, 0
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * images.size(0)
        correct += (outputs.argmax(1) == labels).sum().item()
    return total_loss / len(loader.dataset), correct / len(loader.dataset)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ResNetClassifier(num_classes=10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
criterion = nn.CrossEntropyLoss()

2.3 EfficientNet: 복합 스케일링

EfficientNet은 **너비(width), 깊이(depth), 해상도(resolution)**를 함께 스케일링하는 복합 계수(compound coefficient)를 도입했습니다.

import timm

# EfficientNet-B4 파인튜닝
model = timm.create_model(
    "efficientnet_b4",
    pretrained=True,
    num_classes=100,
    drop_rate=0.3
)

# 백본 고정 (특징 추출 모드)
for name, param in model.named_parameters():
    if "classifier" not in name:
        param.requires_grad = False

2.4 ConvNeXt: CNN의 현대화

ConvNeXt는 Transformer의 설계 원칙(큰 커널, LayerNorm, GELU, 더 적은 활성화)을 pure CNN에 적용하여 Swin Transformer와 대등한 성능을 달성했습니다.

model = timm.create_model(
    "convnext_large",
    pretrained=True,
    num_classes=1000
)
# 7x7 depthwise conv, LayerNorm, GELU 활성화 사용

3. 객체 탐지: YOLO, DETR, Faster R-CNN

3.1 탐지 방식 비교

방식	모델	속도	정확도	특징
Anchor-based 2-stage	Faster R-CNN	느림	높음	RPN + 분류기 분리
Anchor-based 1-stage	YOLOv8	빠름	중-높음	단일 패스 추론
Anchor-free 1-stage	YOLOv10, FCOS	매우 빠름	높음	NMS-free, 앵커 제거
Transformer	DETR	중간	높음	End-to-end, 관계 모델링

3.2 YOLOv8 실전 사용

from ultralytics import YOLO

# 사전 학습 모델 로드
model = YOLO("yolov8n.pt")   # nano: 속도 최우선
# model = YOLO("yolov8x.pt") # extra-large: 정확도 최우선

# 단일 이미지 추론
results = model("image.jpg", conf=0.25, iou=0.45)
for result in results:
    for box in result.boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        print(f"Class: {result.names[cls]}, Conf: {conf:.2f}")

# 커스텀 데이터셋 파인튜닝
model.train(
    data="custom.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,
    device=0
)

# 평가
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")

3.3 YOLOv10: NMS-Free 탐지

YOLOv10은 Non-Maximum Suppression(NMS) 후처리를 제거하고 **이중 레이블 할당(dual label assignment)**으로 end-to-end 학습을 실현했습니다. 일관성 매칭(consistency matching)으로 one-to-one 예측과 one-to-many 예측을 동시에 활용합니다.

from ultralytics import YOLO

model = YOLO("yolov10n.pt")
# NMS 없이 바로 최종 결과 반환 — 지연 시간 단축
results = model.predict("video.mp4", stream=True)
for frame_result in results:
    print(frame_result.boxes)

3.4 DETR: Detection Transformer

DETR은 bipartite matching loss를 통해 anchor와 NMS 없이 직접 최종 박스 집합을 예측합니다.

import torch
from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

img = Image.open("image.jpg")
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)

target_sizes = torch.tensor([img.size[::-1]])
results = processor.post_process_object_detection(
    outputs, target_sizes=target_sizes, threshold=0.9
)[0]

for score, label, box in zip(
    results["scores"], results["labels"], results["boxes"]
):
    name = model.config.id2label[label.item()]
    print(f"{name}: {score:.3f} at {[round(i,2) for i in box.tolist()]}")

4. 세그멘테이션: DeepLab, Mask R-CNN, SAM

4.1 세그멘테이션 유형

시맨틱(Semantic): 픽셀마다 클래스 레이블 (자동차, 도로, 하늘...)
인스턴스(Instance): 같은 클래스 내 개별 객체 구분 (자동차1, 자동차2...)
파노프틱(Panoptic): 시맨틱 + 인스턴스 통합

4.2 SAM: Segment Anything Model

Meta의 SAM은 **프롬프트(점, 박스, 마스크)**를 받아 임의의 객체를 세그멘테이션합니다. Image Encoder(ViT-H), Prompt Encoder, Mask Decoder의 3-모듈 구조로, SA-1B 데이터셋(1억 개 마스크)으로 학습된 범용 세그멘테이션 모델입니다.

import torch
import numpy as np
import cv2
from segment_anything import sam_model_registry, SamPredictor

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)

# 이미지 설정
image = cv2.cvtColor(cv2.imread("image.jpg"), cv2.COLOR_BGR2RGB)
predictor.set_image(image)

# 포인트 프롬프트로 세그멘테이션
input_point = np.array([[500, 375]])  # 클릭 위치 (x, y)
input_label = np.array([1])           # 1=전경, 0=배경

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True   # 여러 마스크 후보 반환
)

best_mask = masks[scores.argmax()]
print(f"Mask shape: {best_mask.shape}, Score: {scores.max():.3f}")

# 박스 프롬프트
input_box = np.array([100, 100, 400, 400])
masks_box, _, _ = predictor.predict(
    box=input_box[None, :],
    multimask_output=False
)

4.3 DeepLabV3+ 시맨틱 세그멘테이션

DeepLabV3+는 ASPP(Atrous Spatial Pyramid Pooling)로 다양한 스케일의 문맥 정보를 포착합니다.

import torch
import torchvision.models.segmentation as seg_models

model = seg_models.deeplabv3_resnet101(
    weights=seg_models.DeepLabV3_ResNet101_Weights.COCO_WITH_VOC_LABELS_V1
)
model.eval()

with torch.no_grad():
    output = model(tensor.unsqueeze(0))["out"]  # [1, num_classes, H, W]
    pred = output.argmax(dim=1).squeeze()        # [H, W]
    print(f"Segmentation map shape: {pred.shape}")

4.4 Mask R-CNN 인스턴스 세그멘테이션

from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection import MaskRCNN_ResNet50_FPN_V2_Weights
from torchvision.utils import draw_segmentation_masks
import torchvision.transforms.functional as TF
from PIL import Image

weights = MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = maskrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.75)
model.eval()

img = Image.open("people.jpg").convert("RGB")
inp = weights.transforms()(img).unsqueeze(0)

with torch.no_grad():
    predictions = model(inp)

pred = predictions[0]
masks = (pred["masks"] > 0.5).squeeze(1)
img_uint8 = (TF.to_tensor(img) * 255).byte()
result = draw_segmentation_masks(img_uint8, masks, alpha=0.5)

5. Vision Transformer: ViT, Swin, DINOv2

5.1 ViT 핵심 원리

ViT(Vision Transformer)는 이미지를 **고정 크기 패치(16x16)**로 분할하고, 각 패치를 토큰으로 취급해 Transformer에 입력합니다. CNN과 달리 locality inductive bias 없이 전역 관계를 학습합니다.

import timm
import torch

# ViT-Base/16 파인튜닝
model = timm.create_model(
    "vit_base_patch16_224",
    pretrained=True,
    num_classes=10,
    img_size=224
)

# ViT는 더 강한 증강 + AdamW + cosine schedule이 효과적
data_config = timm.data.resolve_model_data_config(model)
transforms_train = timm.data.create_transform(**data_config, is_training=True)
transforms_val = timm.data.create_transform(**data_config, is_training=False)

optimizer = torch.optim.AdamW(
    model.parameters(), lr=1e-3, weight_decay=0.05
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100
)

5.2 Swin Transformer: 계층적 ViT

Swin Transformer는 계층적 특징 맵과 Shifted Window Attention으로 CNN의 locality를 ViT에 도입했습니다. 각 스테이지에서 해상도를 반씩 줄이고 채널을 늘려 FPN과 호환됩니다.

모델	해상도	파라미터	ImageNet Top-1
ViT-B/16	224	86M	81.8%
Swin-T	224	28M	81.3%
Swin-B	224	88M	83.5%
Swin-L	384	197M	87.3%
DINOv2-L	518	307M	86.3%

5.3 DINOv2: 자기지도 학습의 정점

DINOv2는 레이블 없이 대규모 이미지 데이터로 학습한 범용 비전 인코더입니다. self-supervised learning으로 ImageNet 지도 학습 모델을 능가합니다.

import torch

# DINOv2 특징 추출기 (학습 없이 바로 사용)
dinov2 = torch.hub.load("facebookresearch/dinov2", "dinov2_vitl14")
dinov2.eval().cuda()

# 임의 크기 이미지 처리
import torchvision.transforms as T
preprocess = T.Compose([
    T.Resize(518),
    T.CenterCrop(518),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

with torch.no_grad():
    # [B, 3, 518, 518] -> [B, 1024] 특징
    features = dinov2(preprocess(img).unsqueeze(0).cuda())
    print(f"Feature dim: {features.shape}")  # [1, 1024]

# DINOv2는 k-NN 분류기로도 강력한 성능을 보임

6. 생성 모델: GAN, Diffusion, ControlNet

6.1 생성 모델 비교

생성 모델	대표 모델	학습 방식	장점	단점
GAN	StyleGAN3	적대적 학습	고속 생성	학습 불안정, 모드 붕괴
VAE	VQ-VAE-2	재구성 + KL	안정적 학습	흐릿한 이미지
Diffusion	DDPM, DDIM	노이즈 제거	품질 최고	느린 생성
LDM	Stable Diffusion	잠재 공간 확산	품질+속도 균형	높은 GPU 메모리

6.2 Stable Diffusion: 잠재 확산 모델

Stable Diffusion은 **Latent Diffusion Model(LDM)**입니다. U-Net이 잠재 공간에서 노이즈를 단계별로 제거합니다.

Forward process: 이미지에 가우시안 노이즈를 T 스텝에 걸쳐 추가
Reverse process: U-Net이 noisy latent z_t, timestep t, 텍스트 임베딩을 받아 노이즈 epsilon 예측
VAE decoder: 최종 잠재 벡터를 픽셀 공간으로 복원

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

# SDXL 1.0 텍스트-이미지 생성
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)
pipe.to("cuda")

image = pipe(
    prompt="a photorealistic cat on a desk, 8k, studio lighting",
    negative_prompt="blurry, low quality, cartoon",
    num_inference_steps=25,
    guidance_scale=7.5,
    width=1024,
    height=1024
).images[0]
image.save("generated.png")

6.3 ControlNet: 구조 조건부 생성

ControlNet은 엣지맵, 깊이맵, 포즈 등 구조적 조건으로 이미지 생성을 정밀 제어합니다.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import cv2
import numpy as np
from PIL import Image

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Canny 엣지맵 생성
ref_image = np.array(load_image("reference.jpg"))
edges = cv2.Canny(ref_image, 100, 200)
edges_rgb = np.stack([edges] * 3, axis=-1)

result = pipe(
    "a beautiful landscape painting",
    image=Image.fromarray(edges_rgb),
    num_inference_steps=20
).images[0]
result.save("controlnet_output.png")

7. 실전 파이프라인: DataLoader부터 TensorRT까지

7.1 커스텀 Dataset 및 DataLoader

from torch.utils.data import Dataset, DataLoader
from pathlib import Path
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2

class CustomDataset(Dataset):
    def __init__(self, root: str, split: str = "train"):
        self.root = Path(root)
        self.image_paths = list(
            (self.root / split / "images").glob("*.jpg")
        )
        self.label_paths = [
            self.root / split / "labels" / p.with_suffix(".txt").name
            for p in self.image_paths
        ]
        self.transform = self._get_transforms(split)

    def _get_transforms(self, split: str):
        if split == "train":
            return A.Compose([
                A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
                A.HorizontalFlip(p=0.5),
                A.ColorJitter(brightness=0.4, contrast=0.4,
                              saturation=0.4, hue=0.1, p=0.8),
                A.GaussianBlur(blur_limit=(3, 7), p=0.2),
                A.Normalize(mean=(0.485, 0.456, 0.406),
                            std=(0.229, 0.224, 0.225)),
                ToTensorV2(),
            ])
        return A.Compose([
            A.Resize(256, 256),
            A.CenterCrop(224, 224),
            A.Normalize(mean=(0.485, 0.456, 0.406),
                        std=(0.229, 0.224, 0.225)),
            ToTensorV2(),
        ])

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = cv2.cvtColor(
            cv2.imread(str(self.image_paths[idx])), cv2.COLOR_BGR2RGB
        )
        label = int(self.label_paths[idx].read_text().strip())
        augmented = self.transform(image=image)
        return augmented["image"], label

train_dataset = CustomDataset("data/", split="train")
train_loader = DataLoader(
    train_dataset, batch_size=32, shuffle=True,
    num_workers=4, pin_memory=True, persistent_workers=True
)

7.2 ONNX 변환

import torch
import torch.onnx

model.eval()
dummy_input = torch.randn(1, 3, 224, 224, device="cuda")

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=17,
    do_constant_folding=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input":  {0: "batch_size"},
        "output": {0: "batch_size"}
    }
)
print("ONNX 변환 완료: model.onnx")

# ONNX Runtime으로 검증
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession(
    "model.onnx", providers=["CUDAExecutionProvider"]
)
input_name = sess.get_inputs()[0].name
result = sess.run(None, {input_name: dummy_input.cpu().numpy()})
print(f"ONNX 추론 결과 shape: {result[0].shape}")

7.3 TensorRT 최적화

# TensorRT 변환 (trtexec 사용)
trtexec \
  --onnx=model.onnx \
  --saveEngine=model_fp16.engine \
  --fp16 \
  --workspace=4096 \
  --optShapes=input:8x3x224x224

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

def trt_inference(engine_path: str, input_data: np.ndarray) -> np.ndarray:
    logger = trt.Logger(trt.Logger.WARNING)
    with open(engine_path, "rb") as f:
        engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())
    context = engine.create_execution_context()

    input_mem = cuda.mem_alloc(input_data.nbytes)
    output = np.empty((input_data.shape[0], 1000), dtype=np.float32)
    output_mem = cuda.mem_alloc(output.nbytes)

    cuda.memcpy_htod(input_mem, input_data)
    context.execute_v2([int(input_mem), int(output_mem)])
    cuda.memcpy_dtoh(output, output_mem)
    return output

퀴즈: 컴퓨터 비전 심화 이해

Q1. ResNet의 skip connection이 vanishing gradient를 해결하는 원리는?

정답: 역전파 시 기울기가 skip connection을 통해 직접 전달되어 깊은 레이어에서도 소실 없이 흐릅니다.

설명: 일반 깊은 신경망에서 역전파 기울기는 레이어를 거칠수록 곱셈으로 점점 작아집니다. ResNet의 잔차 블록 F(x) + x는 미분 시 dF/dx + 1이 되어, 최소 기울기 1이 항상 보장됩니다. 이 덕분에 100층 이상도 안정적으로 학습할 수 있습니다.

Q2. YOLO가 Faster R-CNN보다 실시간 추론에 적합한 이유는?

정답: YOLO는 단일 패스(single forward pass)로 탐지를 완료하지만, Faster R-CNN은 Region Proposal Network와 분류기 두 단계가 필요합니다.

설명: Faster R-CNN은 (1) RPN으로 후보 영역 생성, (2) RoI Pooling, (3) 분류 및 회귀의 다단계 구조입니다. YOLO는 이미지를 그리드로 나눠 한 번의 CNN 패스로 모든 박스와 클래스를 동시 예측합니다. YOLOv8n은 A100 GPU에서 80+ FPS 달성이 가능합니다.

Q3. Vision Transformer가 CNN보다 대규모 데이터에서 더 좋은 성능을 내는 이유는?

정답: ViT의 Self-Attention은 모든 패치 간 전역적 관계를 학습하여 inductive bias 없이 데이터에서 직접 최적 표현을 학습합니다.

설명: CNN은 locality(지역성)와 translation equivariance라는 inductive bias를 내장합니다. 적은 데이터에서는 이 편향이 도움이 되지만, 대규모 데이터(JFT-300M 등)에서는 표현력을 제한합니다. ViT는 편향 없이 전역 패턴을 자유롭게 학습하므로 데이터가 충분할 때 CNN을 압도합니다.

Q4. Stable Diffusion의 denoising diffusion process에서 U-Net의 역할은?

정답: U-Net은 각 timestep에서 잠재 벡터에 추가된 노이즈를 예측하여 제거하며, 텍스트 조건(CLIP 임베딩)을 cross-attention으로 통합합니다.

설명: 순전파(forward process)에서 이미지에 가우시안 노이즈를 T 스텝에 걸쳐 추가합니다. 역전파(reverse process)에서 U-Net은 noisy latent z_t와 timestep t, 텍스트 임베딩을 입력받아 노이즈 성분 epsilon을 예측합니다. VAE 디코더로 최종 잠재 벡터를 픽셀 공간으로 복원합니다.

Q5. SAM의 prompt-based segmentation이 기존 방식과 다른 점은?

정답: SAM은 점, 박스, 마스크 등 다양한 프롬프트를 받아 학습 없이 임의의 객체를 제로샷으로 세그멘테이션합니다.

설명: 기존 세그멘테이션 모델(DeepLab, Mask R-CNN)은 특정 클래스 집합에 대해 지도 학습됩니다. SAM은 SA-1B 데이터셋(1억 개 마스크)으로 학습한 범용 모델로, 클래스에 구애받지 않고 사용자가 지정한 영역을 분리합니다. Image Encoder(ViT-H), Prompt Encoder, Mask Decoder의 3-모듈 구조로 되어 있습니다.

마무리: 학습 로드맵

컴퓨터 비전은 빠르게 발전하는 분야입니다. 추천 학습 경로:

기초: OpenCV, NumPy 이미지 처리 → torchvision 실습
분류: ResNet/EfficientNet 파인튜닝 → 커스텀 데이터셋 적용
탐지: YOLOv8 실험 → 커스텀 학습 → ONNX/TensorRT 배포
세그멘테이션: SAM 실험 → Mask R-CNN/DeepLabV3+ 커스텀 학습
고급: ViT/DINOv2 특징 추출 → Stable Diffusion 파인튜닝

각 단계마다 Kaggle 대회나 실제 프로젝트로 적용하는 것이 가장 빠른 학습 방법입니다.

Computer Vision Complete Guide: CNN, ViT, YOLO, and Stable Diffusion

Introduction
1. Image Fundamentals: Pixels, Channels, Convolution
2. CNN Architecture Evolution
3. Object Detection: YOLO, DETR, Faster R-CNN
4. Segmentation: DeepLab, Mask R-CNN, SAM
5. Vision Transformer: ViT, Swin, DINOv2
6. Generative Models: GAN, Diffusion, ControlNet
7. Production Pipeline: DataLoader to TensorRT
Quiz: Deep Understanding Check
Wrap-Up: Learning Roadmap

Introduction

Computer Vision is the field of AI that enables machines to understand images and video. From smartphone face unlock to autonomous vehicle obstacle detection to medical imaging assistance — computer vision powers them all.

This guide takes you from pixel-level fundamentals through Vision Transformers, Stable Diffusion, and production deployment with PyTorch code examples throughout.

1. Image Fundamentals: Pixels, Channels, Convolution

1.1 Digital Image Structure

A digital image is a 2D grid of pixels.

Grayscale image: 2D array of shape H x W, pixel values 0–255
Color image (RGB): 3D tensor of shape H x W x 3, each channel 0–255
Resolution: image dimensions (e.g., 1920x1080), pixel density (DPI)

import torch
import torchvision.transforms as T
from PIL import Image

# Load image and convert to tensor
img = Image.open("sample.jpg").convert("RGB")
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),          # [0,255] -> [0.0,1.0], HWC -> CHW
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
])
tensor = transform(img)   # shape: [3, 224, 224]
print(f"Shape: {tensor.shape}, dtype: {tensor.dtype}")

1.2 Convolution and Common Filters

Convolution slides a small kernel (filter) across the entire image to extract features.

import torch
import torch.nn.functional as F

# 3x3 Sobel edge detection kernel (horizontal)
kernel = torch.tensor([
    [-1, 0, 1],
    [-2, 0, 2],
    [-1, 0, 1]
], dtype=torch.float32).unsqueeze(0).unsqueeze(0)  # [1, 1, 3, 3]

# Apply convolution to a grayscale image
gray_tensor = T.ToTensor()(T.Grayscale()(img)).unsqueeze(0)
edges = F.conv2d(gray_tensor, kernel, padding=1)

Kernel Type	Purpose	Use Case
Sobel	Edge detection (H/V)	Lane detection in self-driving
Gaussian	Blurring, noise removal	Image pre-processing
Laplacian	Edge sharpening	Image enhancement
Average	Mean blur	Downsampling

1.3 Albumentations Augmentation Pipeline

import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2

train_transform = A.Compose([
    A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.4, contrast=0.4,
                  saturation=0.4, hue=0.1, p=0.8),
    A.GaussianBlur(blur_limit=(3, 7), p=0.2),
    A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1,
                       rotate_limit=15, p=0.5),
    A.Normalize(mean=(0.485, 0.456, 0.406),
                std=(0.229, 0.224, 0.225)),
    ToTensorV2(),
])

image = cv2.cvtColor(cv2.imread("sample.jpg"), cv2.COLOR_BGR2RGB)
augmented = train_transform(image=image)["image"]

2. CNN Architecture Evolution

2.1 Major Architecture Timeline

Year	Architecture	Key Contribution	ImageNet Top-1
1998	LeNet-5	First practical CNN, conv+pool structure	-
2012	AlexNet	ReLU, Dropout, GPU training	63.3%
2014	VGGNet	Stacking deep 3x3 kernels	74.4%
2015	ResNet-50	Skip connections, residual learning	76.0%
2017	DenseNet	Direct connections between all layers	77.4%
2019	EfficientNet-B7	Compound scaling	84.4%
2022	ConvNeXt-L	Transformer design principles in CNN	86.6%

2.2 ResNet: The Residual Learning Revolution

ResNet's key innovation is the skip connection (residual connection). Adding input x directly to the output resolves the vanishing gradient problem. The residual block output F(x) + x has gradient dF/dx + 1, guaranteeing a minimum gradient of 1 through any residual path.

import torch
import torch.nn as nn
import torchvision.models as models

class ResNetClassifier(nn.Module):
    def __init__(self, num_classes: int, pretrained: bool = True):
        super().__init__()
        weights = models.ResNet50_Weights.IMAGENET1K_V2 if pretrained else None
        self.backbone = models.resnet50(weights=weights)
        in_features = self.backbone.fc.in_features
        self.backbone.fc = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(in_features, num_classes)
        )

    def forward(self, x):
        return self.backbone(x)

def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss, correct = 0.0, 0
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * images.size(0)
        correct += (outputs.argmax(1) == labels).sum().item()
    return total_loss / len(loader.dataset), correct / len(loader.dataset)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ResNetClassifier(num_classes=10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
criterion = nn.CrossEntropyLoss()

2.3 EfficientNet: Compound Scaling

EfficientNet introduces a compound coefficient that scales width, depth, and resolution together with a fixed ratio, achieving superior accuracy/efficiency trade-offs.

import timm

# EfficientNet-B4 fine-tuning
model = timm.create_model(
    "efficientnet_b4",
    pretrained=True,
    num_classes=100,
    drop_rate=0.3
)

# Freeze backbone (feature extraction mode)
for name, param in model.named_parameters():
    if "classifier" not in name:
        param.requires_grad = False

2.4 ConvNeXt: Modernizing CNN

ConvNeXt applies Transformer design principles (large kernels, LayerNorm, GELU, fewer activations) to a pure CNN, matching Swin Transformer performance with comparable speed.

model = timm.create_model(
    "convnext_large",
    pretrained=True,
    num_classes=1000
)
# Uses 7x7 depthwise conv, LayerNorm, GELU activation

3. Object Detection: YOLO, DETR, Faster R-CNN

3.1 Detection Approach Comparison

Approach	Model	Speed	Accuracy	Notes
Anchor-based 2-stage	Faster R-CNN	Slow	High	RPN + classifier separated
Anchor-based 1-stage	YOLOv8	Fast	Med-High	Single pass inference
Anchor-free 1-stage	YOLOv10, FCOS	Very fast	High	NMS-free, no anchors
Transformer	DETR	Medium	High	End-to-end, relational modeling

3.2 YOLOv8 in Practice

from ultralytics import YOLO

# Load pretrained model
model = YOLO("yolov8n.pt")   # nano: speed first
# model = YOLO("yolov8x.pt") # extra-large: accuracy first

# Single image inference
results = model("image.jpg", conf=0.25, iou=0.45)
for result in results:
    for box in result.boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        print(f"Class: {result.names[cls]}, Conf: {conf:.2f}")

# Fine-tune on custom dataset
model.train(
    data="custom.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,
    device=0
)

# Evaluation
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")

3.3 YOLOv10: NMS-Free Detection

YOLOv10 eliminates the Non-Maximum Suppression (NMS) post-processing step through dual label assignment and consistency matching, enabling true end-to-end inference with reduced latency.

from ultralytics import YOLO

model = YOLO("yolov10n.pt")
# Returns final detections directly without NMS — lower latency
results = model.predict("video.mp4", stream=True)
for frame_result in results:
    print(frame_result.boxes)

3.4 DETR: Detection Transformer

DETR uses bipartite matching loss to predict the final set of boxes directly, eliminating both anchors and NMS.

import torch
from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

img = Image.open("image.jpg")
inputs = processor(images=img, return_tensors="pt")
outputs = model(**inputs)

target_sizes = torch.tensor([img.size[::-1]])
results = processor.post_process_object_detection(
    outputs, target_sizes=target_sizes, threshold=0.9
)[0]

for score, label, box in zip(
    results["scores"], results["labels"], results["boxes"]
):
    name = model.config.id2label[label.item()]
    print(f"{name}: {score:.3f} at {[round(i, 2) for i in box.tolist()]}")

4. Segmentation: DeepLab, Mask R-CNN, SAM

4.1 Segmentation Task Types

Semantic: Class label per pixel (car, road, sky...)
Instance: Distinguishes individual objects of the same class (car1, car2...)
Panoptic: Combines semantic and instance segmentation

4.2 SAM: Segment Anything Model

Meta's SAM accepts prompts (points, boxes, masks) and segments any object zero-shot. It uses a 3-module architecture: Image Encoder (ViT-H), Prompt Encoder, and Mask Decoder, trained on the SA-1B dataset (1 billion masks).

import torch
import numpy as np
import cv2
from segment_anything import sam_model_registry, SamPredictor

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)

# Set image
image = cv2.cvtColor(cv2.imread("image.jpg"), cv2.COLOR_BGR2RGB)
predictor.set_image(image)

# Point-prompted segmentation
input_point = np.array([[500, 375]])  # click location (x, y)
input_label = np.array([1])           # 1=foreground, 0=background

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True   # return multiple candidate masks
)

best_mask = masks[scores.argmax()]
print(f"Mask shape: {best_mask.shape}, Score: {scores.max():.3f}")

# Box-prompted segmentation
input_box = np.array([100, 100, 400, 400])
masks_box, _, _ = predictor.predict(
    box=input_box[None, :],
    multimask_output=False
)

4.3 DeepLabV3+ Semantic Segmentation

DeepLabV3+ uses ASPP (Atrous Spatial Pyramid Pooling) to capture multi-scale context information.

import torch
import torchvision.models.segmentation as seg_models

model = seg_models.deeplabv3_resnet101(
    weights=seg_models.DeepLabV3_ResNet101_Weights.COCO_WITH_VOC_LABELS_V1
)
model.eval()

with torch.no_grad():
    output = model(tensor.unsqueeze(0))["out"]  # [1, num_classes, H, W]
    pred = output.argmax(dim=1).squeeze()        # [H, W]
    print(f"Segmentation map shape: {pred.shape}")

4.4 Mask R-CNN Instance Segmentation

from torchvision.models.detection import maskrcnn_resnet50_fpn_v2
from torchvision.models.detection import MaskRCNN_ResNet50_FPN_V2_Weights
from torchvision.utils import draw_segmentation_masks
import torchvision.transforms.functional as TF
from PIL import Image

weights = MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = maskrcnn_resnet50_fpn_v2(weights=weights, box_score_thresh=0.75)
model.eval()

img = Image.open("people.jpg").convert("RGB")
inp = weights.transforms()(img).unsqueeze(0)

with torch.no_grad():
    predictions = model(inp)

pred = predictions[0]
masks = (pred["masks"] > 0.5).squeeze(1)
img_uint8 = (TF.to_tensor(img) * 255).byte()
result = draw_segmentation_masks(img_uint8, masks, alpha=0.5)

5. Vision Transformer: ViT, Swin, DINOv2

5.1 ViT Core Concept

ViT (Vision Transformer) splits images into fixed-size patches (16x16) and treats each patch as a token fed into a Transformer. Unlike CNN, it learns global relationships without locality inductive bias.

import timm
import torch

# ViT-Base/16 fine-tuning
model = timm.create_model(
    "vit_base_patch16_224",
    pretrained=True,
    num_classes=10,
    img_size=224
)

# ViT benefits from strong augmentation + AdamW + cosine schedule
data_config = timm.data.resolve_model_data_config(model)
transforms_train = timm.data.create_transform(**data_config, is_training=True)
transforms_val = timm.data.create_transform(**data_config, is_training=False)

optimizer = torch.optim.AdamW(
    model.parameters(), lr=1e-3, weight_decay=0.05
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100
)

5.2 Swin Transformer: Hierarchical ViT

Swin Transformer introduces hierarchical feature maps and Shifted Window Attention, bringing CNN locality into ViT. Each stage halves resolution and doubles channels, making it FPN-compatible.

Model	Resolution	Parameters	ImageNet Top-1
ViT-B/16	224	86M	81.8%
Swin-T	224	28M	81.3%
Swin-B	224	88M	83.5%
Swin-L	384	197M	87.3%
DINOv2-L	518	307M	86.3%

5.3 DINOv2: Self-Supervised Learning

DINOv2 is a general-purpose vision encoder trained on large-scale image data without labels, surpassing supervised ImageNet models via self-supervised learning.

import torch
import torchvision.transforms as T

# DINOv2 feature extractor (ready to use without fine-tuning)
dinov2 = torch.hub.load("facebookresearch/dinov2", "dinov2_vitl14")
dinov2.eval().cuda()

preprocess = T.Compose([
    T.Resize(518),
    T.CenterCrop(518),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

with torch.no_grad():
    # [B, 3, 518, 518] -> [B, 1024] features
    features = dinov2(preprocess(img).unsqueeze(0).cuda())
    print(f"Feature dim: {features.shape}")  # [1, 1024]

# DINOv2 achieves strong performance even with k-NN classifiers

6. Generative Models: GAN, Diffusion, ControlNet

6.1 Generative Model Comparison

Model Type	Representative	Training	Strengths	Weaknesses
GAN	StyleGAN3	Adversarial	Fast generation	Unstable training, mode collapse
VAE	VQ-VAE-2	Recon + KL	Stable training	Blurry images
Diffusion	DDPM, DDIM	Denoising	Best quality	Slow generation
LDM	Stable Diffusion	Latent diffusion	Quality + speed	High GPU memory

6.2 Stable Diffusion: Latent Diffusion Model

Stable Diffusion is a Latent Diffusion Model (LDM). The U-Net progressively removes noise in latent space.

Forward process: Add Gaussian noise to the image over T steps
Reverse process: U-Net predicts noise epsilon from noisy latent z_t, timestep t, and text embedding
VAE decoder: Restores the final latent vector to pixel space

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

# SDXL 1.0 text-to-image generation
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)
pipe.to("cuda")

image = pipe(
    prompt="a photorealistic cat on a desk, 8k, studio lighting",
    negative_prompt="blurry, low quality, cartoon",
    num_inference_steps=25,
    guidance_scale=7.5,
    width=1024,
    height=1024
).images[0]
image.save("generated.png")

6.3 ControlNet: Structure-Conditioned Generation

ControlNet adds precise structural control to image generation using edge maps, depth maps, poses, and other spatial conditions.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import cv2
import numpy as np
from PIL import Image

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Generate Canny edge map
ref_image = np.array(load_image("reference.jpg"))
edges = cv2.Canny(ref_image, 100, 200)
edges_rgb = np.stack([edges] * 3, axis=-1)

result = pipe(
    "a beautiful landscape painting",
    image=Image.fromarray(edges_rgb),
    num_inference_steps=20
).images[0]
result.save("controlnet_output.png")

7. Production Pipeline: DataLoader to TensorRT

7.1 Custom Dataset and DataLoader

from torch.utils.data import Dataset, DataLoader
from pathlib import Path
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2

class CustomDataset(Dataset):
    def __init__(self, root: str, split: str = "train"):
        self.root = Path(root)
        self.image_paths = list(
            (self.root / split / "images").glob("*.jpg")
        )
        self.label_paths = [
            self.root / split / "labels" / p.with_suffix(".txt").name
            for p in self.image_paths
        ]
        self.transform = self._get_transforms(split)

    def _get_transforms(self, split: str):
        if split == "train":
            return A.Compose([
                A.RandomResizedCrop(224, 224, scale=(0.7, 1.0)),
                A.HorizontalFlip(p=0.5),
                A.ColorJitter(brightness=0.4, contrast=0.4,
                              saturation=0.4, hue=0.1, p=0.8),
                A.GaussianBlur(blur_limit=(3, 7), p=0.2),
                A.Normalize(mean=(0.485, 0.456, 0.406),
                            std=(0.229, 0.224, 0.225)),
                ToTensorV2(),
            ])
        return A.Compose([
            A.Resize(256, 256),
            A.CenterCrop(224, 224),
            A.Normalize(mean=(0.485, 0.456, 0.406),
                        std=(0.229, 0.224, 0.225)),
            ToTensorV2(),
        ])

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = cv2.cvtColor(
            cv2.imread(str(self.image_paths[idx])), cv2.COLOR_BGR2RGB
        )
        label = int(self.label_paths[idx].read_text().strip())
        augmented = self.transform(image=image)
        return augmented["image"], label

train_dataset = CustomDataset("data/", split="train")
train_loader = DataLoader(
    train_dataset, batch_size=32, shuffle=True,
    num_workers=4, pin_memory=True, persistent_workers=True
)

7.2 ONNX Export

import torch
import torch.onnx

model.eval()
dummy_input = torch.randn(1, 3, 224, 224, device="cuda")

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=17,
    do_constant_folding=True,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input":  {0: "batch_size"},
        "output": {0: "batch_size"}
    }
)
print("ONNX export complete: model.onnx")

# Verify with ONNX Runtime
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession(
    "model.onnx", providers=["CUDAExecutionProvider"]
)
input_name = sess.get_inputs()[0].name
result = sess.run(None, {input_name: dummy_input.cpu().numpy()})
print(f"ONNX inference result shape: {result[0].shape}")

7.3 TensorRT Optimization

# TensorRT conversion using trtexec
trtexec \
  --onnx=model.onnx \
  --saveEngine=model_fp16.engine \
  --fp16 \
  --workspace=4096 \
  --optShapes=input:8x3x224x224

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

def trt_inference(engine_path: str, input_data: np.ndarray) -> np.ndarray:
    logger = trt.Logger(trt.Logger.WARNING)
    with open(engine_path, "rb") as f:
        engine = trt.Runtime(logger).deserialize_cuda_engine(f.read())
    context = engine.create_execution_context()

    input_mem = cuda.mem_alloc(input_data.nbytes)
    output = np.empty((input_data.shape[0], 1000), dtype=np.float32)
    output_mem = cuda.mem_alloc(output.nbytes)

    cuda.memcpy_htod(input_mem, input_data)
    context.execute_v2([int(input_mem), int(output_mem)])
    cuda.memcpy_dtoh(output, output_mem)
    return output

Quiz: Deep Understanding Check

Q1. How do ResNet skip connections solve the vanishing gradient problem?

Answer: During backpropagation, gradients flow directly through the skip connection, preventing them from vanishing in deep layers.

Explanation: In a plain deep network, backpropagated gradients shrink multiplicatively through each layer. ResNet's residual block F(x) + x differentiates to dF/dx + 1, guaranteeing a minimum gradient of 1 through any residual path. This enables stable training even beyond 100 layers.

Q2. Why is YOLO better suited for real-time inference than Faster R-CNN?

Answer: YOLO completes detection in a single forward pass, while Faster R-CNN requires two stages: a Region Proposal Network and a separate classifier.

Explanation: Faster R-CNN follows a pipeline of (1) region proposals via RPN, (2) RoI Pooling, and (3) classification and bounding-box regression. YOLO divides the image into a grid and simultaneously predicts all boxes and classes in one CNN pass. YOLOv8n achieves 80+ FPS on an A100 GPU, making it suitable for real-time applications.

Q3. Why does Vision Transformer outperform CNN on large-scale data?

Answer: ViT's Self-Attention learns global relationships between all patches without inductive bias, discovering optimal representations directly from data.

Explanation: CNNs encode locality and translation equivariance as inductive biases. These biases help with small datasets but limit representational capacity at scale. Given sufficient data (e.g., JFT-300M), ViT freely learns global patterns without these constraints, surpassing CNNs in accuracy.

Q4. What is the U-Net's role in Stable Diffusion's denoising diffusion process?

Answer: The U-Net predicts and removes the noise added to the latent vector at each timestep, and integrates the text condition (CLIP embedding) via cross-attention.

Explanation: In the forward process, Gaussian noise is added to the image latent over T steps. In the reverse process, the U-Net receives the noisy latent z_t, timestep t, and text embedding, and predicts the noise component epsilon. The VAE decoder then reconstructs the final image from the denoised latent.

Q5. How does SAM's prompt-based segmentation differ from conventional methods?

Answer: SAM segments arbitrary objects zero-shot from various prompts (points, boxes, masks) without task-specific training.

Explanation: Traditional segmentation models (DeepLab, Mask R-CNN) are trained with supervision on a fixed set of classes. SAM is a general-purpose model trained on the SA-1B dataset (1 billion masks), which segments any region specified by the user regardless of class. Its 3-module architecture — Image Encoder (ViT-H), Prompt Encoder, and Mask Decoder — separates encoding from flexible prompt conditioning.

Wrap-Up: Learning Roadmap

Computer vision evolves rapidly. Here is a recommended learning path:

Foundations: OpenCV, NumPy image manipulation → torchvision hands-on
Classification: ResNet/EfficientNet fine-tuning → custom dataset
Detection: YOLOv8 experiments → custom training → ONNX/TensorRT deployment
Segmentation: SAM exploration → Mask R-CNN / DeepLabV3+ custom training
Advanced: ViT/DINOv2 feature extraction → Stable Diffusion fine-tuning

The fastest path to mastery is applying each concept in Kaggle competitions or real-world projects. Welcome to the world of computer vision!