Split View: 비전 모델 개발·파인튜닝 완전 가이드 2026 — CNN, ViT, DETR, SAM 2, VLM 까지 실전 의사결정 트리

비전 모델 개발·파인튜닝 완전 가이드 2026 — CNN, ViT, DETR, SAM 2, VLM 까지 실전 의사결정 트리

프롤로그 — 2026년의 비전 엔지니어가 마주하는 진짜 질문

2026년 5월 어느 월요일, 비전팀 리드가 받은 티켓 한 줄:

"공장 라인 카메라 영상에서 불량 부품을 표시해 주세요. 클래스는 12개, 하루 30만 장 처리, 정확도 95% 이상, 엣지 디바이스(Jetson Orin Nano)에서 30fps."

같은 요구사항을 2018년에 받았다면 답은 명확했다 — ResNet50 백본에 RetinaNet 헤드, COCO 사전학습 후 자체 데이터로 파인튜닝. 끝.

2026년에는 답이 8개쯤 된다.

YOLOv11/12 로 그냥 파인튜닝
RT-DETR 로 트랜스포머 기반 탐지
SAM 2 로 마스크를 뽑고 분류기를 얹는다
Florence-2 같은 비전 파운데이션 모델에 프롬프트만 던진다
Gemini 2.5 Vision 이나 Claude Vision 에 사진을 보내서 자연어로 결과를 받는다
CLIP 으로 임베딩만 뽑고 kNN 으로 분류한다
OWLv2/Grounding DINO 로 텍스트 프롬프트 기반 zero-shot 탐지
위 중 두세 개를 파이프라인으로 결합한다

이 글은 그 8개의 선택지를 언제, 왜, 어떻게 골라야 하는지에 대한 가이드다. 단순히 "최신 모델이 좋다"가 아니라, 데이터 규모·정확도 요구·지연 예산·운영 비용 네 축으로 의사결정 트리를 만든다.

2026년 비전 엔지니어의 핵심 스킬은 더 이상 "모델 학습"이 아니다. "어떤 모델을 학습할지, 학습할 필요가 있긴 한지를 결정하는 것"이다.

1장 · 아키텍처 가족 — CNN 부터 VLM 까지 한눈에

비전 모델은 결국 "이미지 → 무언가"인데, 그 "무언가"가 무엇인지에 따라 아키텍처 가족이 갈린다.

가족	대표 모델	출현 시점	입력 처리	강점	약점
CNN	ResNet, EfficientNet, ConvNeXt v2	2012~	합성곱 + 풀링	적은 데이터, 빠름	글로벌 컨텍스트 약함
ViT	ViT, DeiT, Swin v2, EVA-02	2020~	패치 + Self-Attention	데이터 많을 때 최강	데이터 적으면 약함
DETR 계열	DETR, Deformable DETR, RT-DETR	2020~	인코더-디코더 + 쿼리	NMS 없는 탐지	학습 수렴 느림
SAM 계열	SAM, SAM 2, HQ-SAM	2023~	ViT 백본 + 마스크 디코더	프롬프트 가능한 세그멘테이션	의미적 분류 못 함
VLM	LLaVA-1.6, Qwen2.5-VL, Gemini Vision, Claude Vision, GPT-4V	2023~	이미지 인코더 + LLM	자연어로 추론, OCR, VQA	비싸고 느림, 정확도 비결정적
멀티모달 파운데이션	Florence-2, InternVL3, DINOv2	2023~	통합 ViT, multi-task head	zero-shot · few-shot	파인튜닝 노하우 필요

이 표는 외워야 한다. 다음 8장에 걸쳐 이 표의 각 행을 풀어쓴다.

핵심 원리: 모든 비전 모델은 결국 "이미지를 토큰의 시퀀스로 본다" — CNN은 그것을 공간 격자로, ViT는 패치 시퀀스로, SAM은 프롬프트와 함께, VLM은 LLM의 입력 토큰으로. 표현(representation)이 모델을 정의한다.

2장 · CNN 은 죽지 않았다 — 2026년의 위치

ViT가 모든 걸 다 잡아먹었다는 마케팅과 달리, 2026년에도 CNN은 살아있다. 특히 다음 상황에서.

CNN을 골라야 할 때

데이터가 적다 — 라벨 1만 장 이하. ViT는 사전학습 없이는 거의 학습이 안 된다.
엣지에서 돌려야 한다 — Jetson, Coral, 휴대폰. ConvNeXt-Tiny가 ViT-Tiny보다 같은 FLOPs에서 빠르다.
지연이 진짜 빡빡하다 — 1ms 이하. 작은 CNN은 GPU에서 0.5ms도 가능하다.
해상도가 매우 높다 — 4K 의료 영상. ViT는 패치 수가 폭발한다.

timm 으로 모델 한 줄 로드

PyTorch Image Models(timm)는 2026년에도 비전 백본의 사실상 표준이다. 1000개 이상의 사전학습 백본을 한 줄로 부른다.

import timm
import torch

# ConvNeXt v2 large, ImageNet-22k 사전학습, 22k-1k 파인튜닝
model = timm.create_model(
    'convnextv2_large.fcmae_ft_in22k_in1k_384',
    pretrained=True,
    num_classes=12,  # 우리 태스크의 클래스 수
)

# 입력 변환은 모델이 알려준다
data_config = timm.data.resolve_model_data_config(model)
transform = timm.data.create_transform(**data_config, is_training=True)

# 배치 차원을 포함한 텐서 모양은 `B x C x H x W`
x = torch.randn(2, 3, 384, 384)
logits = model(x)  # shape: 2 x 12

timm.list_models('convnext*', pretrained=True) 로 후보를 본다. pretrained_cfg 안에 mean, std, input_size, crop_pct 가 들어있어서 변환을 통일하기 쉽다.

CNN 학습 레시피 — 작은 데이터에서 잘 되는 법

사전학습 → 헤드만 학습 → 전체 파인튜닝 의 3단계
Mixup, CutMix, RandAugment — 라벨 1만 장 이하면 거의 필수
EMA(Exponential Moving Average) — 검증 정확도 1~2%p 공짜로 얻는다
Cosine schedule + 짧은 warmup — OneCycleLR 도 좋다
AdamW + weight decay 0.05 — 옛날의 SGD는 잊자

3장 · ViT — 데이터가 충분할 때의 챔피언

ViT(Vision Transformer)는 이미지를 16x16 같은 패치로 잘라 시퀀스로 만들고, 그 위에 Transformer를 얹는다. 핵심 통찰은 "이미지에 inductive bias가 적은 모델이 데이터만 충분하면 CNN을 이긴다" 였다.

2026년의 ViT 변종

Swin Transformer v2 — 윈도우 어텐션, 고해상도 입력 효율적
DeiT III — 데이터 효율적 학습 레시피
EVA-02 — Masked Image Modeling, 22B 파라미터까지 스케일
DINOv2 — 자기지도학습, 라벨 없이 학습된 강력한 백본
SigLIP / SigLIP 2 — 대조학습 기반, 강력한 이미지-텍스트 임베딩

ViT를 골라야 할 때

조건	추천
라벨 데이터 10만 장 이상	ViT 또는 Swin
라벨이 적지만 이미지가 100만 장 이상 (자기지도학습 가능)	DINOv2 사전학습 → 헤드만 파인튜닝
OCR · 텍스트가 많은 이미지	SigLIP 2 또는 ViT-L 패치 14
다국어 OCR · 표 인식	InternVL3 의 ViT 백본

Hugging Face transformers 로 ViT 분류기 한 줄

from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/dinov2-large')
model = AutoModelForImageClassification.from_pretrained(
    'facebook/dinov2-large',
    num_labels=12,
    ignore_mismatched_sizes=True,  # 헤드만 새로
)

img = Image.open('part.jpg').convert('RGB')
inputs = processor(images=img, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)
pred = outputs.logits.argmax(-1).item()

facebook/dinov2-large 대신 microsoft/swinv2-large-patch4-window12-192-22k 같은 백본도 동일 API다.

4장 · 객체 탐지 — YOLO vs DETR vs RT-DETR

탐지는 "어디에 무엇이"를 동시에 푸는 태스크다. 2026년 시점에서 실무는 크게 둘로 갈린다.

YOLO 가족 — 압도적 실전 점유율

Ultralytics 의 YOLOv8 부터 YOLOv12 까지, 그리고 YOLO-NAS, YOLOv9, YOLOv10 같은 변종들. 속도와 배포 편의성에서는 여전히 YOLO가 1등이다.

from ultralytics import YOLO

# 사전학습 모델 로드
model = YOLO('yolo11x.pt')

# 자체 데이터셋으로 파인튜닝
results = model.train(
    data='factory_parts.yaml',  # train/val 경로 + 클래스 이름
    epochs=100,
    imgsz=640,
    batch=32,
    device=0,
    optimizer='AdamW',
    lr0=0.001,
    cos_lr=True,
    patience=20,  # early stopping
    project='runs/factory',
)

# ONNX 로 내보내기 (엣지 배포)
model.export(format='onnx', dynamic=True, simplify=True)
# TensorRT 로 내보내기 (Jetson)
model.export(format='engine', half=True)

factory_parts.yaml 은 다음 형태다.

path: /data/factory
train: images/train
val: images/val
names:
  0: scratch
  1: dent
  2: discoloration
  3: missing_screw

YOLO의 단점: NMS 기반이라 빽빽한 객체 · 작은 객체에 약하고, DETR 류에 비해 글로벌 컨텍스트가 약하다.

DETR 가족 — NMS 없는 트랜스포머 탐지

DETR(DEtection TRansformer)는 객체 쿼리와 헝가리안 매칭으로 NMS를 제거했다. 2026년에는 RT-DETR 이 실용적 선택이다.

from transformers import RTDetrV2ForObjectDetection, RTDetrImageProcessor

processor = RTDetrImageProcessor.from_pretrained('PekingU/rtdetr_v2_r50vd')
model = RTDetrV2ForObjectDetection.from_pretrained(
    'PekingU/rtdetr_v2_r50vd',
    num_labels=12,
    ignore_mismatched_sizes=True,
)

YOLO vs DETR — 의사결정

상황	추천
30fps 이상 엣지 추론	YOLO
COCO 평균 ish 객체 밀도	YOLO
빽빽한 작은 객체 (위성, 의료)	RT-DETR 또는 Co-DETR
텍스트 프롬프트로 unseen 클래스 탐지	Grounding DINO / OWLv2
비디오 추적까지 필요	YOLO + ByteTrack 또는 SAM 2

OpenMMLab 은 언제 쓰나

mmdetection, mmsegmentation 등 OpenMMLab 생태계는 연구·실험에 강점이다. 같은 코드베이스에서 50+ 탐지 모델을 비교하고, 설정 파일 한 줄로 백본을 교체한다. 다만 학습 곡선이 가파르고 배포는 별도 도구가 필요하다. 프로덕션 첫 모델로는 Ultralytics 가 훨씬 빠르다.

5장 · 세그멘테이션과 SAM 2

세그멘테이션은 "픽셀 단위로 어디"다. 2026년에는 SAM 2 가 거의 모든 경우의 출발점이다.

SAM 2 — 이미지와 비디오를 모두 다루는 promptable segmentation

Meta 가 2024년 7월 공개한 SAM 2 는 "클릭·박스·마스크 프롬프트만 주면 어떤 객체든 세그먼트하고, 비디오 프레임 간에 자동 추적" 하는 모델이다. 핵심은:

이미지 + 비디오 통합 — 같은 모델로 둘 다
메모리 어텐션 — 비디오에서 객체를 추적하기 위해 과거 프레임의 표현을 기억
promptable — 마스크가 아니라 클릭/박스 같은 사용자 입력으로 분할

from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
import numpy as np
from PIL import Image

sam2 = build_sam2('configs/sam2.1/sam2.1_hiera_l.yaml', 'sam2.1_hiera_large.pt')
predictor = SAM2ImagePredictor(sam2)

img = np.array(Image.open('part.jpg'))
predictor.set_image(img)

# 클릭 한 번으로 마스크
point_coords = np.array([[450, 320]])
point_labels = np.array([1])  # 1 = foreground
masks, scores, _ = predictor.predict(
    point_coords=point_coords,
    point_labels=point_labels,
    multimask_output=True,
)

SAM 2 의 진짜 활용법 — "분할은 SAM, 분류는 별도"

SAM 2 는 무엇인지 알려주지 않는다. "이 마스크는 객체 1, 저 마스크는 객체 2" 라고만 한다. 의미적 라벨이 필요하면 두 단계 파이프라인이다.

SAM 2 로 마스크 생성
각 마스크 영역을 잘라서 CLIP 또는 SigLIP 으로 분류

이게 2026년의 "zero-shot 세그멘테이션" 표준 레시피다. 라벨 데이터 없이도 동작한다.

HQ-SAM, MobileSAM, EfficientSAM 의 자리

HQ-SAM — 더 정밀한 경계, 의료 · 위성 영상
MobileSAM / EfficientSAM — 엣지 디바이스용 경량
SAM 2.1 — 비디오 추적 정밀도 강화 버전

6장 · VLM — 자연어로 이미지를 다루는 시대

Vision-Language Model 은 "이미지를 LLM 의 입력 토큰으로 변환" 한다. 사용자는 자연어로 질문하고, 모델은 자연어로 답한다.

2026년의 주요 VLM

모델	강점	약점
Claude Sonnet/Opus Vision	차트 · 다이어그램 · 문서 OCR · 추론	API only, 가격
Gemini 2.5 Pro Vision	긴 비디오, 멀티 이미지, 다국어 OCR	API only
GPT-5 Vision	일반 추론, 코드와 결합	API only
Qwen2.5-VL 72B / 7B	오픈웨이트, GUI 이해, 비디오	자체 호스팅 필요
LLaVA-OneVision / LLaVA-1.6	학습 레시피 공개, 연구 친화	최신 모델 대비 약함
InternVL3 78B / 8B	멀티이미지, 문서, 오픈웨이트 강자	VRAM 요구 큼
Molmo	포인팅 능력, 데이터 투명	정확도 평균
Pixtral 12B	Mistral 의 오픈 VLM	OCR 약점

VLM 을 "프롬프트만 잘 짜서" 쓸 때

라벨 데이터가 거의 없거나, 클래스가 자주 바뀌거나, "왜 그렇게 판단했는가" 설명이 필요한 경우 — VLM 에 사진과 시스템 프롬프트만 던지는 게 ROI가 가장 높다.

import anthropic
import base64

client = anthropic.Anthropic()

with open('part.jpg', 'rb') as f:
    img_b64 = base64.standard_b64encode(f.read()).decode()

resp = client.messages.create(
    model='claude-opus-4-7',
    max_tokens=512,
    system=(
        '당신은 자동차 부품 검사관이다. 사진을 보고 다음 JSON 으로만 답하라. '
        'schema: {"defect": one of [scratch, dent, discoloration, missing_screw, none], '
        '"severity": one of [low, medium, high], "reasoning": short string}'
    ),
    messages=[{
        'role': 'user',
        'content': [
            {'type': 'image', 'source': {'type': 'base64', 'media_type': 'image/jpeg', 'data': img_b64}},
            {'type': 'text', 'text': '이 부품을 검사하라.'},
        ],
    }],
)
print(resp.content[0].text)

VLM 의 한계 — 무조건 좋지 않다

비용 — 사진 한 장에 0.01~~0.05달러. 하루 30만 장이면 월 9만~~45만 달러. 자체 모델 학습 비용과 비교해 보자.
지연 — 200ms~2s. 30fps 엣지에서 못 쓴다.
비결정성 — 같은 사진에 다른 답이 가끔 나온다. 임계값 기반 의사결정에는 보정 필요.
JSON 안 지키기 — response_format 또는 도구 호출(tool use) 로 강제해야 한다.
데이터 거버넌스 — 사진을 외부 API로 보낼 수 있는가? 법무팀과 먼저 이야기하자.

7장 · 어떤 태스크 → 어떤 아키텍처 매트릭스

위 6장을 한 표로 압축한다. 첫 모델을 고를 때 이 표만 봐도 80%는 맞다.

태스크	데이터 1k 이하	데이터 10k~100k	데이터 100k 이상	라벨 거의 없음
이미지 분류	CLIP zero-shot 또는 ConvNeXt-T 파인튜닝	ConvNeXt-Base, ViT-B	EVA-02, DINOv2-L 사전학습 후 헤드	CLIP/SigLIP zero-shot
객체 탐지	YOLO11n + 강한 증강	YOLO11x 또는 RT-DETR	DETR 변종 + 자체 백본 사전학습	Grounding DINO, OWLv2
세그멘테이션	SAM 2 + CLIP 라벨	SAM 2 파인튜닝 또는 Mask2Former	Mask2Former + Swin v2	SAM 2 zero-shot
OCR	TrOCR 파인튜닝	TrOCR 또는 PaddleOCR	자체 학습 + 데이터 합성	Claude/Gemini Vision
캡셔닝	BLIP-2 프롬프트	BLIP-2 또는 LLaVA 파인튜닝	InternVL3 파인튜닝	VLM 직접 호출
시각 QA	VLM API 직접 호출	LLaVA-OneVision LoRA	Qwen2.5-VL 72B 파인튜닝	VLM API 직접 호출
변칙 탐지	PatchCore, PaDiM	EfficientAD	자체 학습 + 합성 결함	DINOv2 임베딩 + kNN

규칙 한 줄: 데이터가 적으면 사전학습된 거대 백본 + 작은 헤드. 데이터가 많으면 자체 학습. 라벨이 없으면 VLM 또는 임베딩.

8장 · 데이터 — 얼마나, 어떻게, 어디서

"모델보다 데이터가 더 중요하다"는 클리셰는 2026년에도 사실이다.

필요한 데이터 양 (대략)

태스크	클래스당 최소	권장	충분
분류 (강한 사전학습 사용)	50	500	5,000
객체 탐지	200 박스	2,000 박스	20,000 박스
세그멘테이션 (SAM 2 사용 시 줄어듦)	50 마스크	500 마스크	5,000 마스크
OCR (라인 단위)	1,000 라인	10,000 라인	100,000 라인
VLM 파인튜닝 (LoRA)	200 예시	2,000 예시	20,000 예시

공개 데이터셋 — 2026년에도 살아있는 것들

ImageNet-22k / -1k — 분류 사전학습의 변하지 않는 기준
COCO 2017 — 탐지 · 키포인트 · 캡션, 여전히 벤치마크 표준
Open Images V7 — 9M+ 이미지, 약한 라벨 포함
LAION-5B / DataComp — CLIP 류 학습용 대규모 이미지-텍스트 쌍 (저작권 검토 필수)
LVIS — 1200+ 클래스, long-tail 탐지
ADE20K, Cityscapes, Mapillary — 세그멘테이션
DocVQA, ChartQA, InfographicVQA — 문서/차트 VQA
OpenX-Embodiment, Ego4D — 로봇·1인칭 비디오
SA-1B — SAM 학습 데이터, 1.1B 마스크

라벨링 도구 — 2026년 실용 비교

도구	강점	약점	가격
Label Studio	오픈소스, 모든 태스크	UI 무거움	무료 / Enterprise 유료
CVAT	비디오 탐지·세그 최강, 오픈소스	호스팅 부담	무료 / Cloud 유료
Roboflow	빠른 시작, 자동 라벨링, SAM 통합	클라우드 의존	프리/팀/엔터프라이즈
V7 Darwin	의료·복잡한 워크플로	가격	유료
Encord	비디오·LLM/VLM 평가	가격	유료
Scale AI / Surge AI	휴먼 어노테이션 외주	단가	시간당 또는 라벨당

2026년의 라벨링 비밀: 1차로 SAM 2 + Grounding DINO 로 자동 라벨링 → Roboflow/CVAT 에서 사람이 교정. 라벨링 시간이 5~10배 줄어든다.

9장 · 학습 vs 파인튜닝 vs 프롬프팅 — 비용 모델

전략	데이터 요구	학습 비용	추론 비용	변경 비용	정확도 상한
처음부터 학습	100M+ 이미지	$100k+	가장 낮음	매우 큼	매우 높음
파인튜닝(전체)	10k~1M	$100~$ 10k	낮음	작음	높음
파인튜닝(LoRA/QLoRA)	200~50k	$10~$ 1k	낮음	매우 작음	중간~높음
프롬프팅(VLM)	0~수십 예시	$0	가장 높음	0	모델 한계만큼
임베딩 + kNN	50~5k	$0~$ 100	낮음	작음	중간

규칙: 데이터가 늘수록 학습 비용이 합리화된다. 추론 트래픽이 늘수록 자체 모델이 합리화된다. 두 축을 같이 봐야 한다.

LoRA / QLoRA — VLM 파인튜닝의 실전 표준

VLM 전체를 풀파인튜닝하는 건 80GB VRAM x 4장 시대도 빡빡하다. LoRA 가 답이다.

from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
import torch

model_id = 'Qwen/Qwen2.5-VL-7B-Instruct'
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map='auto',
)

lora_cfg = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()  # 보통 0.1~1% 만 학습

cfg = SFTConfig(
    output_dir='runs/qwen-vl-defect',
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    lr_scheduler_type='cosine',
    warmup_ratio=0.03,
    bf16=True,
    gradient_checkpointing=True,
    logging_steps=20,
    save_strategy='epoch',
)

# train_dataset 은 {"image": PIL.Image, "messages": [...]} 의 시퀀스
trainer = SFTTrainer(model=model, args=cfg, train_dataset=train_ds, processing_class=processor)
trainer.train()

QLoRA(4bit 양자화 + LoRA)면 24GB VRAM 한 장으로 7B VLM 파인튜닝이 가능하다. 70B 급은 80GB 한 장.

10장 · 배포 — ONNX, TensorRT, Core ML, TFLite

학습은 끝의 시작이다. 배포 단계에서 잘못된 선택을 하면 추론 비용이 10배 든다.

배포 타깃별 추천

타깃	추천 포맷	비고
NVIDIA GPU 서버	TensorRT, 또는 ONNX + TensorRT EP	FP16/INT8 양자화, dynamic batch
CPU 서버	ONNX Runtime, OpenVINO	INT8 양자화 필수, AVX-512 활용
Jetson (엣지 GPU)	TensorRT	모델별 엔진 빌드, JetPack 버전 매칭
iOS	Core ML	`coremltools` 로 변환, ANE 활용
Android	TFLite, LiteRT, ONNX Runtime Mobile	NNAPI 또는 GPU delegate
웹 브라우저	ONNX Runtime Web, WebGPU	모델 크기 1~50MB
로컬 데스크톱 LLM/VLM	llama.cpp (GGUF), Ollama, MLX	Apple Silicon 강력

PyTorch → ONNX → TensorRT 흐름

import torch
import torch.onnx
from ultralytics import YOLO

model = YOLO('runs/factory/best.pt')

# 1) ONNX
model.export(format='onnx', dynamic=True, simplify=True, opset=17)

# 2) TensorRT (NVIDIA GPU) — Ultralytics가 직접 지원
model.export(format='engine', half=True, workspace=4)

수동으로 한다면 trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 가 가장 간단하다. INT8 양자화는 calibration dataset 이 필요하다.

Core ML / TFLite — 모바일

# Core ML
import coremltools as ct
mlmodel = ct.convert(traced_model, inputs=[ct.ImageType(shape=(1, 3, 224, 224))])
mlmodel.save('Model.mlpackage')

# TFLite (PyTorch → TF는 ai-edge-torch 권장)
import ai_edge_torch
edge_model = ai_edge_torch.convert(model.eval(), sample_inputs)
edge_model.export('model.tflite')

11장 · 실패 모드 — 현장에서 마주치는 진짜 문제들

증상	원인	대처
검증 정확도 99%, 프로덕션 70%	분포 이동(domain shift)	프로덕션 샘플로 추가 학습 또는 도메인 적응
클래스 한 개만 잘 맞춤	클래스 불균형	focal loss, class-balanced sampler, oversampling
추론 메모리 폭주	dynamic shape, batch 1 만 학습	dynamic axes export, max-batch 고정
같은 사진에 다른 결과	nondeterminism, half-precision	seed 고정, FP32 검증, `torch.backends.cudnn.deterministic`
작은 객체 못 잡음	입력 해상도 낮음, anchor 미스매치	입력 해상도 증가, slice-and-merge (SAHI)
VLM이 JSON 안 지킴	프롬프트 약함	도구 호출(tool use) 강제, 또는 `response_format=json_schema`
라벨 노이즈	어노테이션 품질 낮음	라벨러 간 합의(IAA) 측정, confident learning
학습 후 일반화 실패	데이터 누수, 검증셋이 학습셋과 겹침	hash 기반 분할, 시간 기준 분할
GPU 활용률 30%	데이터 로더 병목	num workers 늘리기, persistent workers, NVIDIA DALI
학습이 발산	learning rate 너무 큼, gradient explosion	warmup 늘리기, grad clip, mixed precision 끄고 디버그

현장 격언: "정확도가 안 나오면 모델을 바꾸지 말고 데이터를 다시 봐라. 다섯 번 중 네 번은 데이터가 문제다."

12장 · 의사결정 트리 — 30초 안에 결정하기

새 비전 문제를 받으면 이 순서로 던진다.

"이 문제, VLM 한 줄로 풀려?" — Claude/Gemini Vision 에 사진 10장을 손으로 던져 보자. 90% 정확도가 나오면, 그게 베이스라인이다.
"라벨 데이터가 있는가?" — 없으면 VLM 또는 zero-shot(CLIP, Grounding DINO, SAM 2).
"라벨이 1만 장 이상인가?" — Yes → 자체 학습. No → 사전학습 + 파인튜닝.
"엣지에서 돌아야 하는가?" — Yes → CNN 또는 작은 YOLO. No → ViT 자유.
"30fps 이상 필요한가?" — Yes → YOLO + TensorRT/Core ML. No → DETR 류도 OK.
"왜 그렇게 판단했는지 설명이 필요한가?" — Yes → VLM 또는 attention 시각화. No → 일반 모델.
"오류 비용이 큰가?" (의료·자율주행) — Yes → 앙상블, 캘리브레이션, human-in-the-loop.

이 7가지 질문이면 첫 모델 선택의 80%는 끝난다.

에필로그 — 체크리스트, 안티패턴, 다음 글 예고

첫 비전 모델을 만들기 전 체크리스트

데이터 라벨 분포를 봤는가? (클래스 불균형은 학습 시작 전에 알아야 한다)
검증셋이 학습셋과 시간/소스가 겹치지 않는가?
베이스라인이 있는가? (가장 단순한 모델, 또는 사람의 정확도, 또는 무조건 다수 클래스)
VLM 한 줄 호출의 정확도와 비용을 측정해 봤는가?
학습 비용과 추론 비용을 같이 비교했는가?
배포 타깃의 메모리·지연 제약을 알고 있는가?
라벨러 간 합의(IAA)를 측정했는가?
결함 케이스(corner case) 100장으로 별도 검증셋을 만들었는가?
모니터링 — 프로덕션 분포 이동을 어떻게 감지할 건가?
롤백 — 모델이 망가졌을 때 이전 버전으로 돌아갈 수 있는가?

자주 보는 안티패턴

"최신 SOTA 가 뭐죠?"로 시작하기 — 데이터부터 보자.
사전학습 없이 ViT 부터 시작 — 1만 장 이하면 거의 실패한다.
검증셋을 학습 중 여러 번 보면서 튜닝 — 사실상 학습셋. 별도 테스트셋이 필요하다.
mAP 만 보고 배포 — 클래스별 PR 곡선, 작은 객체 metric, 추론 지연도 같이.
VLM 결과를 후처리 없이 신뢰 — JSON 스키마 검증, fallback, 캐싱.
단일 모델로 모든 클래스를 잡으려는 욕심 — 자주 출현 클래스 + long-tail 분리.
데이터 증강을 모델보다 늦게 신경 — 증강이 모델보다 정확도에 더 큰 영향을 주는 경우가 많다.
학습 코드와 평가 코드의 transform 불일치 — 가장 흔한 디버그 시간 소모 원인 1번.

다음 글 예고

"비전 모델 운영 — 분포 이동 감지, A/B, 카나리아, active learning loop 까지"
"VLM 으로 라벨링 자동화하기 — Claude/Gemini/Qwen-VL 로 어노테이션 90% 자동화 파이프라인"
"엣지 비전 추론 — Jetson, Coral, iPhone, Android 에서 같은 모델 돌리기"

참고 / References

The 2026 Vision Model Development & Fine-Tuning Guide — CNN, ViT, DETR, SAM 2, VLMs and a Real Decision Tree

Prologue — The Real Question a 2026 Vision Engineer Faces

A one-line ticket on a Monday morning in May 2026:

"Mark defective parts in our factory line camera feed. 12 classes, 300k images/day, accuracy 95% or better, runs on a Jetson Orin Nano at 30 fps."

In 2018 the answer would have been obvious — ResNet50 backbone, RetinaNet head, COCO pretraining, fine-tune on your data. Done.

In 2026 there are roughly eight answers.

Just fine-tune YOLOv11/12.
Use RT-DETR for transformer-based detection.
Run SAM 2 for masks and stack a classifier on top.
Prompt a vision foundation model like Florence-2.
Send the photo to Gemini 2.5 Vision or Claude Vision and parse the natural-language result.
Extract CLIP embeddings and do kNN classification.
Use OWLv2 / Grounding DINO for text-prompted zero-shot detection.
Pipeline two or three of the above.

This guide is about when, why, and how to choose among those eight. Not "the latest SOTA wins" — a decision tree across data size, accuracy target, latency budget, and operating cost.

The core skill of a 2026 vision engineer is no longer "train a model." It is "decide which model to train, and whether to train at all."

1. Architecture Families — From CNN to VLM at a Glance

Every vision model is "image to something." The "something" decides the family.

Family	Representative models	First appeared	Input handling	Strength	Weakness
CNN	ResNet, EfficientNet, ConvNeXt v2	2012~	Convolution + pooling	Small data, fast	Weak global context
ViT	ViT, DeiT, Swin v2, EVA-02	2020~	Patches + self-attention	Strongest when data is plentiful	Weak with small data
DETR family	DETR, Deformable DETR, RT-DETR	2020~	Encoder-decoder + queries	NMS-free detection	Slow convergence
SAM family	SAM, SAM 2, HQ-SAM	2023~	ViT backbone + mask decoder	Promptable segmentation	No semantic labels
VLM	LLaVA-1.6, Qwen2.5-VL, Gemini Vision, Claude Vision, GPT-4V	2023~	Image encoder + LLM	Natural-language reasoning, OCR, VQA	Expensive, slow, non-deterministic
Multimodal foundation	Florence-2, InternVL3, DINOv2	2023~	Unified ViT, multi-task heads	Zero-shot, few-shot	Fine-tuning is non-trivial

Memorize this table. The next eight chapters expand each row.

Core principle: every vision model ultimately "sees the image as a sequence of tokens" — a CNN as a spatial grid, a ViT as a patch sequence, SAM together with prompts, a VLM as input tokens to an LLM. Representation defines the model.

2. CNNs Are Not Dead — Where They Win in 2026

Despite marketing claims that ViT ate everything, CNNs are very much alive in 2026. Especially in these situations.

Pick a CNN when

Data is small — under 10k labeled images. ViT without pretraining barely learns.
You deploy at the edge — Jetson, Coral, mobile. A ConvNeXt-Tiny beats a ViT-Tiny at the same FLOPs.
Latency is brutal — sub-millisecond. A small CNN can hit 0.5 ms on a GPU.
Resolution is huge — 4K medical imaging. ViT patch counts explode.

One-line model loading with timm

PyTorch Image Models (timm) remains the de-facto standard for vision backbones in 2026. Over 1,000 pretrained backbones, one line.

import timm
import torch

# ConvNeXt v2 large, ImageNet-22k pretrain, 22k-1k fine-tune
model = timm.create_model(
    'convnextv2_large.fcmae_ft_in22k_in1k_384',
    pretrained=True,
    num_classes=12,  # our task's class count
)

# The model tells you the input transform it expects
data_config = timm.data.resolve_model_data_config(model)
transform = timm.data.create_transform(**data_config, is_training=True)

# Tensor shape including the batch dim is `B x C x H x W`
x = torch.randn(2, 3, 384, 384)
logits = model(x)  # shape: 2 x 12

Use timm.list_models('convnext*', pretrained=True) to see candidates. The pretrained_cfg contains mean, std, input_size, and crop_pct, so transforms stay consistent.

CNN training recipe — making small data work

Pretrain to head-only training to full fine-tuning — three stages.
Mixup, CutMix, RandAugment — almost mandatory below 10k labels.
EMA (Exponential Moving Average) — 1–2pp of validation accuracy for free.
Cosine schedule with a short warmup — OneCycleLR works too.
AdamW with weight decay 0.05 — forget the old SGD.

3. ViT — The Champion When Data Is Plentiful

A ViT slices an image into patches like 16x16, treats them as a sequence, and stacks a Transformer on top. The key insight was "a model with less inductive bias beats a CNN if you give it enough data."

ViT variants worth knowing in 2026

Swin Transformer v2 — window attention, efficient at high resolution.
DeiT III — data-efficient training recipe.
EVA-02 — masked image modeling, scaled to 22B parameters.
DINOv2 — self-supervised, a powerful backbone trained without labels.
SigLIP / SigLIP 2 — contrastive learning, strong image-text embeddings.

When to pick a ViT

Condition	Recommendation
100k+ labeled images	ViT or Swin
Few labels but 1M+ images for self-supervised pretraining	DINOv2 pretrain plus head-only fine-tuning
OCR, text-heavy imagery	SigLIP 2 or ViT-L patch 14
Multilingual OCR, table understanding	InternVL3's ViT backbone

One-line ViT classifier with Hugging Face transformers

from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/dinov2-large')
model = AutoModelForImageClassification.from_pretrained(
    'facebook/dinov2-large',
    num_labels=12,
    ignore_mismatched_sizes=True,  # reinitialize the head
)

img = Image.open('part.jpg').convert('RGB')
inputs = processor(images=img, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)
pred = outputs.logits.argmax(-1).item()

Swap facebook/dinov2-large for microsoft/swinv2-large-patch4-window12-192-22k and the API is identical.

4. Object Detection — YOLO vs DETR vs RT-DETR

Detection solves "what is where" simultaneously. As of 2026 the practical space splits in two.

From Ultralytics YOLOv8 through v12, plus YOLO-NAS, YOLOv9, YOLOv10. For speed and deployment ergonomics, YOLO still wins.

from ultralytics import YOLO

# Load a pretrained model
model = YOLO('yolo11x.pt')

# Fine-tune on your dataset
results = model.train(
    data='factory_parts.yaml',  # train/val paths plus class names
    epochs=100,
    imgsz=640,
    batch=32,
    device=0,
    optimizer='AdamW',
    lr0=0.001,
    cos_lr=True,
    patience=20,  # early stopping
    project='runs/factory',
)

# Export to ONNX (edge deployment)
model.export(format='onnx', dynamic=True, simplify=True)
# Export to TensorRT (Jetson)
model.export(format='engine', half=True)

factory_parts.yaml:

path: /data/factory
train: images/train
val: images/val
names:
  0: scratch
  1: dent
  2: discoloration
  3: missing_screw

YOLO's weaknesses: NMS-based detection struggles on dense or tiny objects, and global context is weaker than DETR-class models.

The DETR family — NMS-free transformer detection

DETR removed NMS by using object queries and Hungarian matching. In 2026 the practical choice is RT-DETR.

from transformers import RTDetrV2ForObjectDetection, RTDetrImageProcessor

processor = RTDetrImageProcessor.from_pretrained('PekingU/rtdetr_v2_r50vd')
model = RTDetrV2ForObjectDetection.from_pretrained(
    'PekingU/rtdetr_v2_r50vd',
    num_labels=12,
    ignore_mismatched_sizes=True,
)

YOLO vs DETR — the decision

Situation	Recommendation
Edge inference at 30 fps or more	YOLO
COCO-ish object density	YOLO
Dense, tiny objects (satellite, medical)	RT-DETR or Co-DETR
Text-prompted detection of unseen classes	Grounding DINO / OWLv2
Video tracking too	YOLO plus ByteTrack, or SAM 2

When OpenMMLab fits

The mmdetection, mmsegmentation, and broader OpenMMLab ecosystem shines in research and experimentation. You can compare 50-plus detection models in one codebase and swap backbones with a config line. But the learning curve is steep, deployment is a separate toolchain, and for a production-first model Ultralytics ships much faster.

5. Segmentation and SAM 2

Segmentation is "where, per pixel." In 2026 SAM 2 is the starting point for almost every case.

SAM 2 — promptable segmentation for both images and video

Released by Meta in July 2024, SAM 2 takes a click, box, or mask prompt and "segments any object, then auto-tracks it across video frames." The core ideas:

Unified image + video — one model for both.
Memory attention — past-frame representations are stored to track objects over time.
Promptable — segmentation is driven by user input (clicks, boxes), not predefined masks.

from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
import numpy as np
from PIL import Image

sam2 = build_sam2('configs/sam2.1/sam2.1_hiera_l.yaml', 'sam2.1_hiera_large.pt')
predictor = SAM2ImagePredictor(sam2)

img = np.array(Image.open('part.jpg'))
predictor.set_image(img)

# One click to a mask
point_coords = np.array([[450, 320]])
point_labels = np.array([1])  # 1 = foreground
masks, scores, _ = predictor.predict(
    point_coords=point_coords,
    point_labels=point_labels,
    multimask_output=True,
)

How to actually use SAM 2 — "SAM segments, something else labels"

SAM 2 does not tell you what something is. It says "this mask is object 1, that mask is object 2." For semantic labels you need a two-stage pipeline.

SAM 2 produces masks.
Crop each mask and classify with CLIP or SigLIP.

This is the 2026 standard recipe for "zero-shot segmentation." Works without any labeled data.

Where HQ-SAM, MobileSAM, and EfficientSAM fit

HQ-SAM — sharper boundaries for medical and satellite imagery.
MobileSAM / EfficientSAM — lightweight for edge devices.
SAM 2.1 — improved video tracking precision.

6. VLMs — The Era of Talking to Images in Natural Language

A Vision-Language Model "turns images into input tokens for an LLM." Users ask in natural language; the model answers in natural language.

The major VLMs in 2026

Model	Strengths	Weaknesses
Claude Sonnet/Opus Vision	Charts, diagrams, document OCR, reasoning	API-only, price
Gemini 2.5 Pro Vision	Long video, multi-image, multilingual OCR	API-only
GPT-5 Vision	General reasoning, code integration	API-only
Qwen2.5-VL 72B / 7B	Open weights, GUI understanding, video	You host it
LLaVA-OneVision / LLaVA-1.6	Public recipe, research-friendly	Behind the frontier
InternVL3 78B / 8B	Multi-image, documents, open-weight leader	Heavy VRAM needs
Molmo	Strong pointing capability, transparent data	Average overall accuracy
Pixtral 12B	Mistral's open VLM	Weak OCR

Using a VLM "just by prompting"

When you barely have labels, when classes change weekly, or when "explain why" is required — sending a photo and a system prompt to a VLM gives the highest ROI.

import anthropic
import base64

client = anthropic.Anthropic()

with open('part.jpg', 'rb') as f:
    img_b64 = base64.standard_b64encode(f.read()).decode()

resp = client.messages.create(
    model='claude-opus-4-7',
    max_tokens=512,
    system=(
        'You are an automotive parts inspector. Look at the photo and answer ONLY '
        'with this JSON. schema: {"defect": one of [scratch, dent, discoloration, '
        'missing_screw, none], "severity": one of [low, medium, high], '
        '"reasoning": short string}'
    ),
    messages=[{
        'role': 'user',
        'content': [
            {'type': 'image', 'source': {'type': 'base64', 'media_type': 'image/jpeg', 'data': img_b64}},
            {'type': 'text', 'text': 'Inspect this part.'},
        ],
    }],
)
print(resp.content[0].text)

VLM limits — they are not always the right answer

Cost — 1 to 5 cents per image. 300k images/day means 90k–450k USD/month. Compare that to training a custom model.
Latency — 200 ms to 2 s. Useless at 30 fps on edge.
Non-determinism — slightly different answers on the same photo. Threshold-based decisions need calibration.
Ignoring JSON — must be enforced via response_format or tool use.
Data governance — can the photo legally go to an external API? Loop in legal before you ship.

7. Which Task to Which Architecture — The Matrix

Eight chapters compressed into one table. When picking the first model, this table alone gets you 80% of the way there.

Task	Under 1k data	10k–100k data	100k+ data	Almost no labels
Image classification	CLIP zero-shot or ConvNeXt-T fine-tune	ConvNeXt-Base, ViT-B	EVA-02, DINOv2-L pretrain plus head	CLIP/SigLIP zero-shot
Object detection	YOLO11n plus heavy augmentation	YOLO11x or RT-DETR	DETR variants plus custom backbone pretrain	Grounding DINO, OWLv2
Segmentation	SAM 2 plus CLIP labeling	SAM 2 fine-tune or Mask2Former	Mask2Former plus Swin v2	SAM 2 zero-shot
OCR	TrOCR fine-tune	TrOCR or PaddleOCR	Custom training plus synthetic data	Claude/Gemini Vision
Captioning	BLIP-2 prompt	BLIP-2 or LLaVA fine-tune	InternVL3 fine-tune	VLM direct call
Visual QA	Direct VLM API call	LLaVA-OneVision LoRA	Qwen2.5-VL 72B fine-tune	Direct VLM API call
Anomaly detection	PatchCore, PaDiM	EfficientAD	Custom training plus synthetic defects	DINOv2 embeddings plus kNN

One-line rule: Small data, large pretrained backbone plus a small head. Large data, custom training. No labels, VLM or embeddings.

8. Data — How Much, How, From Where

The cliche "data matters more than the model" is still true in 2026.

How much data you actually need (rough numbers)

Task	Minimum per class	Recommended	Comfortable
Classification (strong pretraining)	50	500	5,000
Object detection	200 boxes	2,000 boxes	20,000 boxes
Segmentation (less with SAM 2)	50 masks	500 masks	5,000 masks
OCR (line-level)	1,000 lines	10,000 lines	100,000 lines
VLM fine-tuning (LoRA)	200 examples	2,000 examples	20,000 examples

Public datasets still alive in 2026

ImageNet-22k / 1k — the unchanging benchmark for classification pretraining.
COCO 2017 — detection, keypoints, captions; still the standard.
Open Images V7 — 9M+ images with weak labels.
LAION-5B / DataComp — large image-text pairs for CLIP-style training (check copyright).
LVIS — 1,200+ classes, long-tail detection.
ADE20K, Cityscapes, Mapillary — segmentation.
DocVQA, ChartQA, InfographicVQA — document and chart VQA.
OpenX-Embodiment, Ego4D — robotics and first-person video.
SA-1B — SAM training data, 1.1B masks.

Labeling tools — a practical 2026 comparison

Tool	Strengths	Weaknesses	Price
Label Studio	Open source, every task	UI feels heavy	Free / Enterprise paid
CVAT	Best for video detection and segmentation, open source	Self-hosting burden	Free / Cloud paid
Roboflow	Fast start, auto-labeling, SAM integration	Cloud-dependent	Free/Team/Enterprise tiers
V7 Darwin	Medical, complex workflows	Pricing	Paid
Encord	Video, LLM/VLM evaluation	Pricing	Paid
Scale AI / Surge AI	Outsourced human annotation	Per-hour or per-label cost	Service

The 2026 labeling secret: auto-label with SAM 2 plus Grounding DINO first, then have humans correct in Roboflow or CVAT. Labeling time drops 5–10x.

9. Train vs Fine-Tune vs Prompt — The Cost Model

Strategy	Data needed	Training cost	Inference cost	Switching cost	Accuracy ceiling
Train from scratch	100M+ images	$100k+	Lowest	Very high	Very high
Full fine-tune	10k–1M	$100–$ 10k	Low	Low	High
LoRA/QLoRA fine-tune	200–50k	$10–$ 1k	Low	Very low	Medium–high
VLM prompting	0 to dozens of examples	$0	Highest	$0	Whatever the model can do
Embeddings plus kNN	50–5k	$0–$ 100	Low	Low	Medium

Rule of thumb: as data grows, training becomes economical; as inference traffic grows, owning the model becomes economical. You have to balance both axes.

LoRA / QLoRA — the practical standard for VLM fine-tuning

Full fine-tuning a VLM is painful even with 4x 80 GB VRAM. LoRA is the answer.

from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
import torch

model_id = 'Qwen/Qwen2.5-VL-7B-Instruct'
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map='auto',
)

lora_cfg = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()  # usually 0.1–1% trainable

cfg = SFTConfig(
    output_dir='runs/qwen-vl-defect',
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    lr_scheduler_type='cosine',
    warmup_ratio=0.03,
    bf16=True,
    gradient_checkpointing=True,
    logging_steps=20,
    save_strategy='epoch',
)

# train_dataset is a sequence of {"image": PIL.Image, "messages": [...]}
trainer = SFTTrainer(model=model, args=cfg, train_dataset=train_ds, processing_class=processor)
trainer.train()

QLoRA (4-bit quantization plus LoRA) lets you fine-tune a 7B VLM on a single 24 GB GPU. A 70B-class model fits on a single 80 GB card.

10. Deployment — ONNX, TensorRT, Core ML, TFLite

Training is the end of the beginning. A bad deployment choice multiplies your inference cost tenfold.

Recommended formats by target

Target	Recommended format	Notes
NVIDIA GPU servers	TensorRT, or ONNX with TensorRT EP	FP16/INT8 quantization, dynamic batch
CPU servers	ONNX Runtime, OpenVINO	INT8 essential, lean on AVX-512
Jetson (edge GPU)	TensorRT	Per-model engine build, JetPack version must match
iOS	Core ML	Convert via `coremltools`, target ANE
Android	TFLite, LiteRT, ONNX Runtime Mobile	NNAPI or GPU delegate
Web browser	ONNX Runtime Web, WebGPU	1–50 MB model size
Local desktop LLM/VLM	llama.cpp (GGUF), Ollama, MLX	Apple Silicon excels

PyTorch to ONNX to TensorRT flow

import torch
import torch.onnx
from ultralytics import YOLO

model = YOLO('runs/factory/best.pt')

# 1) ONNX
model.export(format='onnx', dynamic=True, simplify=True, opset=17)

# 2) TensorRT (NVIDIA GPU) — directly supported by Ultralytics
model.export(format='engine', half=True, workspace=4)

For manual conversion, trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 is the simplest path. INT8 quantization needs a calibration dataset.

Core ML and TFLite — mobile

# Core ML
import coremltools as ct
mlmodel = ct.convert(traced_model, inputs=[ct.ImageType(shape=(1, 3, 224, 224))])
mlmodel.save('Model.mlpackage')

# TFLite (PyTorch to TF, ai-edge-torch recommended)
import ai_edge_torch
edge_model = ai_edge_torch.convert(model.eval(), sample_inputs)
edge_model.export('model.tflite')

11. Failure Modes — The Real Problems You Hit in Production

Symptom	Root cause	Remedy
99% val accuracy, 70% in production	Domain shift	Add production samples or apply domain adaptation
Only one class predicted well	Class imbalance	Focal loss, class-balanced sampler, oversampling
Inference memory explodes	Dynamic shapes, batch-1 only training	Export with dynamic axes, fix a max batch
Different result on the same image	Non-determinism, half precision	Fix seeds, validate in FP32, set `torch.backends.cudnn.deterministic`
Small objects missed	Input too small, anchor mismatch	Increase resolution, slice-and-merge (SAHI)
VLM ignores JSON	Weak prompt	Force tool use or `response_format=json_schema`
Label noise	Poor annotation quality	Measure inter-annotator agreement (IAA), use confident learning
Fails to generalize after training	Data leak, val set overlaps train	Hash-based or time-based splits
GPU utilization stuck at 30%	Data loader bottleneck	More workers, persistent workers, NVIDIA DALI
Training diverges	LR too high, gradient explosion	Longer warmup, gradient clipping, disable mixed precision for debugging

The field aphorism: "If accuracy isn't moving, don't swap the model — go look at the data. Four times out of five, it's the data."

12. The Decision Tree — Decide in 30 Seconds

When a new vision problem lands on your desk, ask these in order.

"Can a VLM solve this in one prompt?" — Hand-feed Claude/Gemini Vision ten photos. If you get 90% accuracy, that's your baseline.
"Do you have labels?" — No, then VLM or zero-shot (CLIP, Grounding DINO, SAM 2).
"Do you have 10k+ labels?" — Yes, train your own. No, pretrain plus fine-tune.
"Does it run at the edge?" — Yes, CNN or a small YOLO. No, ViT is fair game.
"30 fps or higher?" — Yes, YOLO plus TensorRT/Core ML. No, DETR-class is fine.
"Do you need to explain the decision?" — Yes, VLM or attention visualization. No, a normal model.
"Are errors expensive?" (medical, autonomous driving) — Yes, ensembles, calibration, human-in-the-loop.

Seven questions handle 80% of the first-model decision.

Epilogue — Checklist, Anti-Patterns, Coming Next

Pre-launch checklist for your first vision model

Have you looked at label distribution? (Class imbalance must be known before training.)
Are train and val splits free of time- or source-overlap?
Do you have a baseline? (Simplest model, or human accuracy, or majority class.)
Have you measured one-prompt VLM accuracy and cost?
Have you compared training cost against inference cost?
Do you know the memory and latency limits of your deployment target?
Have you measured inter-annotator agreement (IAA)?
Is there a hand-curated set of 100 corner cases as a separate eval?
Monitoring — how do you detect production drift?
Rollback — can you go back to the previous model when this one breaks?

Common anti-patterns

"What's the latest SOTA?" as the first question — go look at your data first.
Starting from a ViT with no pretraining — almost always fails below 10k labels.
Tuning while peeking at the val set repeatedly — that's training data. Hold out a real test set.
Shipping on mAP alone — also check per-class PR curves, small-object metrics, and inference latency.
Trusting VLM output without post-processing — JSON-schema validation, fallbacks, caching.
Forcing one model to handle every class — split frequent classes from the long tail.
Treating augmentation as an afterthought — often augmentation matters more than model choice.
Train-eval transform mismatch — the number-one debug time sink.

Coming next

"Operating vision models — drift detection, A/B, canary, and active learning loops."
"Auto-labeling with VLMs — a Claude/Gemini/Qwen-VL pipeline that automates 90% of annotation."
"Edge vision inference — running the same model on Jetson, Coral, iPhone, and Android."

References

Architecture papers

ViT — "An Image is Worth 16x16 Words" — https://arxiv.org/abs/2010.11929
Swin Transformer v2 — https://arxiv.org/abs/2111.09883
ConvNeXt v2 — https://arxiv.org/abs/2301.00808
DINOv2 — https://arxiv.org/abs/2304.07193
DETR — "End-to-End Object Detection with Transformers" — https://arxiv.org/abs/2005.12872
RT-DETR — https://arxiv.org/abs/2304.08069
SAM — https://arxiv.org/abs/2304.02643
SAM 2 — https://arxiv.org/abs/2408.00714
LLaVA — https://arxiv.org/abs/2304.08485
Qwen2.5-VL — https://arxiv.org/abs/2502.13923
InternVL — https://arxiv.org/abs/2312.14238
Grounding DINO — https://arxiv.org/abs/2303.05499
Florence-2 — https://arxiv.org/abs/2311.06242
LoRA — https://arxiv.org/abs/2106.09685
QLoRA — https://arxiv.org/abs/2305.14314

Tools and libraries

timm (PyTorch Image Models) — https://github.com/huggingface/pytorch-image-models
Hugging Face transformers — https://huggingface.co/docs/transformers
Ultralytics YOLO — https://docs.ultralytics.com/
OpenMMLab MMDetection — https://github.com/open-mmlab/mmdetection
Segment Anything 2 — https://github.com/facebookresearch/sam2
PEFT (LoRA/QLoRA) — https://github.com/huggingface/peft
TRL (SFT) — https://github.com/huggingface/trl

Labeling tools

Label Studio — https://labelstud.io/
CVAT — https://www.cvat.ai/
Roboflow — https://roboflow.com/
V7 Darwin — https://www.v7labs.com/
Encord — https://encord.com/

Datasets

ImageNet — https://www.image-net.org/
COCO — https://cocodataset.org/
Open Images V7 — https://storage.googleapis.com/openimages/web/index.html
LAION — https://laion.ai/
LVIS — https://www.lvisdataset.org/
ADE20K — https://groups.csail.mit.edu/vision/datasets/ADE20K/
SA-1B — https://ai.meta.com/datasets/segment-anything/

Deployment / inference

ONNX Runtime — https://onnxruntime.ai/
NVIDIA TensorRT — https://developer.nvidia.com/tensorrt
Apple Core ML Tools — https://github.com/apple/coremltools
ai-edge-torch (PyTorch to TFLite) — https://github.com/google-ai-edge/ai-edge-torch
OpenVINO — https://docs.openvino.ai/

VLM APIs

Anthropic Claude Vision — https://docs.anthropic.com/en/docs/build-with-claude/vision
Google Gemini Vision — https://ai.google.dev/gemini-api/docs/vision
OpenAI Vision — https://platform.openai.com/docs/guides/vision