💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue - Computer Vision Became "Asking" Instead of "Seeing"

Computer vision in the 2010s had a clear shape. SIFT, HOG, and Haar extracted features, SVMs and random forests classified them, and OpenCV tied it all together. The early 2020s belonged to ResNet, EfficientNet, and Mask R-CNN - 90% of the job was collecting datasets, training models, and squeezing out mAP.

The landscape in 2026 looks different. A single sentence like "estimate the pose of the person in a red helmet in this photo" maps to a three-line pipeline: Grounding DINO catches the box, SAM 2 makes the mask, MMPose extracts the keypoints. We barely train anything. Instead, we design "which model to call, in what order".

This article walks through the 2026 computer vision stack in one breath. From the basics of OpenCV through SAM 2 and VLMs to DINOv3, DUSt3R, and mobile inference - the criteria for choosing the right tool, packed into a single page.

Chapter 1 - The 2026 CV Stack Map

Before diving into individual tools, let me draw the overall map. The 2026 computer vision world splits into five layers.

[Layer 5] Vision Language Model (VLM)

GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash

Qwen2-VL, InternVL 2.5, Pixtral 12B

[Layer 4] Open-Vocabulary Foundation

Grounding DINO 1.6, Florence-2, YOLO-World

SAM 2, DINOv3, CLIP, SigLIP

[Layer 3] Task-Specific Model

YOLO v11, Detectron3, MMDetection

MMPose, DWPose, ByteTrack, Depth Anything v3

[Layer 2] Inference Runtime

ONNX Runtime, TensorRT, OpenVINO

CoreML, TFLite, NCNN, MNN

[Layer 1] Image I/O and Primitives

OpenCV 4.10, Pillow-SIMD

FFmpeg, GStreamer

The higher the layer, the more "intelligent" the system gets, but latency rises with it. **VLMs run at one or two frames per second, YOLO v11 runs at over 100 fps.** The job of a 2026 vision engineer is composing these two ends.

One-line summary: **"Ask the question with a VLM, draw the answer with YOLO."**

Chapter 2 - OpenCV 4.10 / 5.x - Still the Starting Point for Everything

OpenCV did not die. It got even stronger in 2026. The reason is simple - reading images, cropping, converting color spaces, decoding video frames are all required by any deep learning pipeline.

As of May 2026 OpenCV 4.10 is the LTS, and 5.0 beta is in active development. Three key changes stand out.

First, **the DNN module became the default for ONNX inference.** You can call YOLO, ResNet, or ViT in one line via `cv2.dnn.readNetFromONNX()` without going through PyTorch or TensorFlow.

Second, **G-API (Graph API) is stable.** It expresses input-to-output as a graph and runs on OpenCL, CUDA, or Vulkan backends. Especially powerful on mobile and embedded.

Third, **CUDA and OpenCL acceleration are built in.** The `cv2.cuda` module runs Gaussian blur, optical flow, and image warping directly on the GPU.

1) Read image - color order is BGR (careful!)

img = cv2.imread('input.jpg')

img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

2) Resize - INTER_AREA is best for shrinking

small = cv2.resize(img_rgb, (640, 640), interpolation=cv2.INTER_AREA)

3) DNN inference - load ONNX model directly

net = cv2.dnn.readNetFromONNX('yolov11n.onnx')

net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

blob = cv2.dnn.blobFromImage(small, 1/255.0, (640, 640), swapRB=True)

net.setInput(blob)

outputs = net.forward()

Two things to remember: **OpenCV uses the BGR color space** (different from PIL/PyTorch), and **`imread` returns None on failure** (it does not raise). These two facts cost someone an hour of debugging every week in 2026.

Chapter 3 - MediaPipe 0.10 / MediaPipe Studio - The New Standard for Mobile Real-Time

Google's MediaPipe went through a major shift in late 2024. The older "MediaPipe Solutions API" merged into the **MediaPipe Tasks API**, and the no-code train/deploy tool **MediaPipe Studio** appeared.

As of 2026, MediaPipe offers the following solutions through one-line APIs:

- **Hand Landmarker** - 21 hand keypoints

- **Pose Landmarker** - 33 body keypoints plus segmentation mask

- **Face Landmarker** - 478 facial mesh points plus blendshapes

- **Image Embedder** - MobileNet-V3 embeddings

- **Object Detector** - EfficientDet-Lite

- **Image Segmenter** - Selfie segmentation, hair segmentation

- **Gesture Recognizer** - 7 pre-trained gestures

- **Image Classifier** - EfficientNet-Lite

from mediapipe.tasks import python

from mediapipe.tasks.python import vision

Pose Landmarker - one-line instance creation

options = vision.PoseLandmarkerOptions(

base_options=python.BaseOptions(model_asset_path='pose_landmarker.task'),

running_mode=vision.RunningMode.VIDEO,

num_poses=2,

min_pose_detection_confidence=0.5,

)

landmarker = vision.PoseLandmarker.create_from_options(options)

Inference per frame

result = landmarker.detect_for_video(mp_image, timestamp_ms)

for pose in result.pose_landmarks:

for lm in pose:

print(lm.x, lm.y, lm.z, lm.visibility)

MediaPipe's real value is the **guaranteed 30 to 60 FPS on mobile**. The same task in PyTorch barely manages 5 FPS on a phone. CPU/GPU/NPU auto-dispatch, TFLite optimization, and the XNNPACK backend are bundled together.

The limit is also clear - if the task is not predefined, you cannot use it, and training your own model requires the MediaPipe Model Maker detour. MediaPipe owns the "do a fixed job, fast" niche.

Chapter 4 - Detectron2 / Detectron3 - Meta's Orthodox Detection Toolkit

Meta AI Research's Detectron2 has been the de facto academic standard since its 2019 release. **Detectron3** entered beta in late 2025, and as of 2026 the two coexist.

The differences:

| Item | Detectron2 | Detectron3 |

| --- | --- | --- |

| Default backbones | ResNet, ViT | ConvNeXt v2, DINOv3, SAM2 encoder |

| Detection heads | Mask R-CNN, Cascade R-CNN | Mask R-CNN, Mask2Former, ViTDet |

| Training framework | PyTorch 1.x/2.x | PyTorch 2.5+, torch.compile by default |

| Config system | YACS (yaml) | LazyConfig (pythonic) |

| Distributed training | DDP | FSDP plus activation checkpoint |

A Detectron2 code snippet:

from detectron2 import model_zoo

from detectron2.config import get_cfg

from detectron2.engine import DefaultPredictor

cfg = get_cfg()

cfg.merge_from_file(model_zoo.get_config_file(

"COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"

))

cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(

"COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"

)

cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5

predictor = DefaultPredictor(cfg)

outputs = predictor(image)

outputs["instances"].pred_boxes, pred_masks, pred_classes

Detectron3's LazyConfig replaces yaml with Python objects. IDE autocompletion and type checking work, and conditional logic stays clean.

**When to use Detectron?** When reproducing papers, comparing COCO/LVIS benchmarks, or when you need a "standard" Mask R-CNN baseline. In production many teams migrate to YOLO or MMDetection.

Chapter 5 - The YOLO Family - From v8 to v12

Ultralytics-managed YOLO had v8 in 2024, v9 in late 2024, v10 in 2025, v11 in late 2025, and v12 in early 2026. No other vision framework matches this pace of major releases.

Summary:

| --- | --- | --- | --- |

YOLO's appeal is summarized in a single block:

from ultralytics import YOLO

1) Load - eight tasks share one API

model = YOLO('yolo11n.pt') # nano

model = YOLO('yolo11n-seg.pt') # segmentation

model = YOLO('yolo11n-pose.pt') # pose

model = YOLO('yolo11n-obb.pt') # oriented bounding box

model = YOLO('yolo11n-cls.pt') # classification

2) Inference

results = model('input.jpg')

for r in results:

print(r.boxes.xyxy) # coordinates

print(r.boxes.conf) # confidence

print(r.boxes.cls) # class

3) Training

model.train(data='coco.yaml', epochs=100, imgsz=640)

4) Export - ONNX, TensorRT, CoreML, TFLite all in one line

model.export(format='onnx')

model.export(format='engine') # TensorRT

model.export(format='coreml')

**Warning about AGPL.** Using YOLO models in a SaaS or web service triggers source disclosure obligations. Commercial use requires the Ultralytics Enterprise license. Some companies use RT-DETR, DAMO-YOLO, or D-FINE under Apache-2.0 to avoid this.

Chapter 6 - MMDetection / MMCV / OpenMMLab - The Widest Catalog

OpenMMLab, run by the Shanghai AI Lab, owns the widest model catalog in vision. It has more than 10 sub-projects including **MMDetection** (detection), **MMSegmentation** (segmentation), **MMPose** (pose), **MMTracking** (tracking), **MMDetection3D** (3D detection), and **MMYOLO** (unified YOLO).

Two distinguishing traits:

First, **all models share one config system.** Swapping a Mask R-CNN backbone to ConvNeXt, FPN to BiFPN, or the head to DETR is a few yaml lines.

Second, **benchmark reproducibility is strong.** Reaching paper results within plus or minus 0.3 mAP is normal. It is the framework closest to the academic standard.

from mmdet.apis import init_detector, inference_detector

config = 'configs/yolox/yolox_s_8xb8-300e_coco.py'

checkpoint = 'yolox_s.pth'

model = init_detector(config, checkpoint, device='cuda:0')

result = inference_detector(model, 'demo.jpg')

result.pred_instances.bboxes, scores, labels

The downside of MMDetection is **the steep learning curve.** You have to understand the config system, the Registry pattern, and the Hook system before plugging in your own model. Better suited for "going deep" than "starting fast".

Chapter 7 - Roboflow Universe Plus Supervision - From Annotation to Training

Roboflow sits in the GitHub spot for vision data. As of 2026 **Roboflow Universe** hosts over 300,000 public datasets and over 50,000 pretrained models.

Two key tools:

**Roboflow Annotate** - Web-based annotation. Boxes, polygons, keypoints, and OBB are all supported. The auto-label feature calls SAM and Grounding DINO to draft proposals.

**Supervision** - The vision utility kit Roboflow open-sourced. Visualization, filtering, metrics, and trackers all live in one package.

from ultralytics import YOLO

model = YOLO('yolo11n.pt')

results = model('input.jpg')[0]

Convert to Roboflow Supervision's Detection object

detections = sv.Detections.from_ultralytics(results)

Visualize - box plus label in one line

box_annotator = sv.BoxAnnotator()

label_annotator = sv.LabelAnnotator()

annotated = box_annotator.annotate(scene=image, detections=detections)

annotated = label_annotator.annotate(scene=annotated, detections=detections)

Tracking - ByteTrack in one line

tracker = sv.ByteTrack()

detections = tracker.update_with_detections(detections)

The strength of Supervision is that it **separates model inference from visualization and metrics**. YOLO, Detectron, and MMDetection outputs all unify under the same `Detections` object.

Chapter 8 - HuggingFace Transformers Vision - From ViT to DETR

HuggingFace Transformers is not just for NLP. As of 2026 more than 200 vision models are registered.

Representative catalog:

- **ViT (Vision Transformer)** - the classification standard

- **DETR / Deformable DETR / DINO** - transformer-based detection

- **Mask2Former / OneFormer** - unified segmentation

- **OWL-ViT / OWLv2** - open-vocabulary detection

- **CLIP / SigLIP / SigLIP 2** - image-text embeddings

- **DINOv2 / DINOv3** - self-supervised backbones

- **SAM / SAM 2** - segmentation

- **Depth Anything v2 / v3** - depth estimation

from transformers import pipeline

Classification - one line

classifier = pipeline('image-classification', model='google/vit-base-patch16-224')

print(classifier('input.jpg'))

Detection

detector = pipeline('object-detection', model='facebook/detr-resnet-50')

Open-vocabulary detection

detector = pipeline('zero-shot-object-detection', model='google/owlv2-base-patch16-ensemble')

print(detector('input.jpg', candidate_labels=['cat', 'dog', 'person']))

Segmentation

segmenter = pipeline('image-segmentation', model='facebook/mask2former-swin-large-coco-panoptic')

HuggingFace's appeal is one-line inference via `pipeline()`. Swapping models is just swapping a string. Training is better done elsewhere though - HF Trainer feels awkward for vision compared to PyTorch Lightning or MMDetection.

Chapter 9 - Segment Anything 2 (SAM 2) - The New Standard for Video Masks

Meta's SAM (Segment Anything Model) first appeared in April 2023. **SAM 2** released in July 2024 added **memory attention for video**, not just images. Catch a mask in one frame, and the rest are tracked automatically.

As of 2026, the SAM family:

| --- | --- | --- | --- |

| SAM | 2023 | 91M-636M | Image segmentation |

| SAM 2 | 2024 | 39M-224M | Video plus image |

| FastSAM | 2023 | 68M | YOLOv8-seg backbone, 50x faster |

| MobileSAM | 2023 | 9.8M | Lightweight for mobile |

| EfficientSAM | 2023 | 26M | KD-compressed |

SAM 2 usage:

from sam2.build_sam import build_sam2_video_predictor

predictor = build_sam2_video_predictor(

'configs/sam2_hiera_l.yaml',

'checkpoints/sam2_hiera_large.pt'

)

Init video

state = predictor.init_state(video_path='video.mp4')

One click in first frame - mask auto-tracked

predictor.add_new_points(

inference_state=state,

frame_idx=0,

obj_id=1,

points=[[210, 350]],

labels=[1]

)

Track masks across all frames

for frame_idx, obj_ids, masks in predictor.propagate_in_video(state):

masks shape: (num_objects, H, W)

pass

SAM 2's value is that **once you teach it, it follows forever**. Labeling cost no longer scales with video length. CVAT, Label Studio, and Roboflow all adopted SAM 2 integration as a built-in feature.

Chapter 10 - Grounding DINO 1.5 / 1.6 - Drawing Boxes from Text

IDEA Research's **Grounding DINO** is the model that made "open-vocabulary detection" the standard. The 1.0 came in 2023, 1.5 Pro/Edge in 2024, and 1.6 in late 2024.

Traditional YOLO and Detectron only detect classes (80 or 1203) seen during training. Grounding DINO is different - it draws boxes from natural language prompts like **"red car"**, **"door handle"**, or **"person holding an umbrella"**.

from groundingdino.util.inference import load_model, load_image, predict

model = load_model('GroundingDINO_SwinT_OGC.cfg.py', 'groundingdino_swint_ogc.pth')

image_source, image = load_image('input.jpg')

Natural language prompt - separate noun phrases with periods

TEXT_PROMPT = 'red car. person holding umbrella. door handle.'

BOX_THRESHOLD = 0.35

TEXT_THRESHOLD = 0.25

boxes, logits, phrases = predict(

model=model,

image=image,

caption=TEXT_PROMPT,

box_threshold=BOX_THRESHOLD,

text_threshold=TEXT_THRESHOLD,

)

**Grounding SAM** - the pipeline of Grounding DINO catching boxes plus SAM (or SAM 2) generating masks inside those boxes. The de facto starting point for labeling in 2026.

1) Grounding DINO for boxes

boxes, _, _ = predict(model_gdino, image, 'cat. dog.', 0.35, 0.25)

2) SAM for masks

sam_predictor.set_image(image_source)

masks, _, _ = sam_predictor.predict(box=boxes, multimask_output=False)

These two lines are the entirety of "object segmentation without a dataset". Drawing boxes on 10,000 images by hand collapses into a 30-minute script.

Chapter 11 - Florence-2 and YOLO-World - Other Open-Vocabulary Contenders

There are two more strong open-vocabulary players besides Grounding DINO.

**Florence-2** (Microsoft, 2024) - handles classification, captioning, detection, segmentation, and OCR with one model. Very compact at 0.23B and 0.77B parameters, but the quality is high. Instead of natural language prompts it uses **task tokens** like `<OD>` (detection), `<DENSE_REGION_CAPTION>`, and `<REFERRING_EXPRESSION_SEGMENTATION>`.

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained('microsoft/Florence-2-large', trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained('microsoft/Florence-2-large', trust_remote_code=True)

prompt = '<OD>' # Object Detection task token

inputs = processor(text=prompt, images=image, return_tensors='pt')

generated_ids = model.generate(**inputs, max_new_tokens=1024, num_beams=3)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

parsed = processor.post_process_generation(generated_text, task='<OD>', image_size=(W, H))

**YOLO-World** (Tencent, 2024) - YOLO speed with open-vocabulary on top. 10-20x faster than Grounding DINO, detects from natural language prompts without training.

from ultralytics import YOLOWorld

model = YOLOWorld('yolov8x-worldv2.pt')

model.set_classes(['red car', 'door handle', 'person holding umbrella'])

results = model.predict('input.jpg')

When to pick what. **Quality first - Grounding DINO 1.6**, **speed first - YOLO-World**, **multiple tasks in one model - Florence-2**. These three form the 2026 open-vocabulary triangle.

Chapter 12 - VLM (Vision Language Model) - "Ask the Image"

After GPT-4V appeared in 2024, VLMs became a new layer in computer vision. As of May 2026 the major VLMs are:

**Closed source (API)**

- **GPT-4o** / **GPT-4.5-vision** (OpenAI)

- **Claude 3.5 Sonnet** / **Claude 4 Opus** (Anthropic)

- **Gemini 2.0 Flash** / **Gemini 2.0 Pro** (Google)

**Open source**

- **Qwen2-VL** / **Qwen2.5-VL** (Alibaba) - 2B/7B/72B

- **InternVL 2.5** (OpenGVLab) - 1B/2B/4B/8B/26B/40B/76B

- **Llava-OneVision** (Bytedance/UW) - 0.5B/7B/72B

- **CogVLM2** (Zhipu) - 19B

- **Phi-3.5-vision** (Microsoft) - 4.2B

- **Pixtral 12B** (Mistral) - 12B

VLM usage splits into two patterns.

**(a) Vision QA** - "How many people are in this photo?"

from anthropic import Anthropic

client = Anthropic()

with open('input.jpg', 'rb') as f:

img_b64 = base64.standard_b64encode(f.read()).decode('utf-8')

message = client.messages.create(

model='claude-3-5-sonnet-20241022',

max_tokens=1024,

messages=[{

'role': 'user',

'content': [

{'type': 'image', 'source': {'type': 'base64', 'media_type': 'image/jpeg', 'data': img_b64}},

{'type': 'text', 'text': 'How many people are in this image?'}

]

}]

)

**(b) Structured Output** - "Extract items and prices from this receipt as JSON"

Forcing a JSON schema ends OCR plus parsing in one call

schema = {

'type': 'object',

'properties': {

'items': {'type': 'array', 'items': {

'type': 'object',

'properties': {

'name': {'type': 'string'},

'price': {'type': 'number'},

'quantity': {'type': 'integer'}

}

}},

'total': {'type': 'number'}

}

VLM limits are also clear. **(1) Weak coordinates** - "third from the top-left" is fine, but exact pixel coordinates are inaccurate. **(2) Expensive and slow** - 10,000 images on GPT-4o costs 10-50 dollars, while YOLO is free and finishes in a minute. **(3) Not deterministic** - the same question can produce different answers.

Hence the pattern: **ask "what to look for" with a VLM, find "where it is" with a traditional model.**

Chapter 13 - 3D Vision - DUSt3R, MASt3R, VGGT

The biggest change in 3D vision in 2024-2025 was that **3D reconstruction from just two photos** became possible.

**DUSt3R** (Naver Labs Europe, 2024) - takes two images and directly regresses pixel-wise 3D pointmaps. Works without knowing camera intrinsics. Compressed the complex SfM and MVS pipelines into one model.

**MASt3R** (Naver Labs, 2024) - DUSt3R plus matching. Outputs pixel correspondences between two images. Directly usable for SLAM and localization.

**VGGT** (Meta, 2025) - Visual Geometry Grounded Transformer. Takes multiple images at once and estimates camera poses, depth maps, and pointmaps simultaneously. Overcomes the pairwise limit of DUSt3R.

**Spann3R** (2024) - sequential 3D reconstruction via memory tokens. Closer to video SLAM results.

from dust3r.inference import inference

from dust3r.model import AsymmetricCroCo3DStereo

model = AsymmetricCroCo3DStereo.from_pretrained('naver/DUSt3R_ViTLarge_BaseDecoder_512_dpt')

images = load_images(['img1.jpg', 'img2.jpg'], size=512)

pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)

output = inference(pairs, model, device='cuda', batch_size=1)

extract pointmap, confidence, depth from output

This is a field where Korean contributions are large. The Grenoble team at Naver Labs Europe made DUSt3R, MASt3R, and CroCo.

Chapter 14 - Depth Anything v2/v3, Marigold, DepthPro

Monocular depth estimation exploded in 2024-2025.

**Depth Anything v2** (HKU, 2024) - a strong depth model trained on 62 million unlabeled images. Four sizes: Small (24M), Base (97M), Large (335M), Giant (1.3B). **Depth Anything v3** (2025) strengthened video consistency and metric depth.

**Marigold** (ETH Zürich, 2024) - Stable Diffusion fine-tuned for depth. Diffusion-based means good detail, but slow.

**DepthPro** (Apple, 2024) - estimates metric depth from one image in 0.3 seconds. The basis for iPhone depth without LiDAR.

from transformers import pipeline

pipe = pipeline(task='depth-estimation', model='depth-anything/Depth-Anything-V2-Large-hf')

depth = pipe('input.jpg')['depth'] # PIL Image

When to pick what. **Relative depth is enough** - Depth Anything, **metric depth needed** - DepthPro, **detail first** - Marigold.

Chapter 15 - Pose Estimation - MMPose, OpenPose, AlphaPose, DWPose

Pose estimation finds keypoints. The 2026 standard catalog:

| Tool | Keypoints | Notes |

| --- | --- | --- |

| MediaPipe Pose | 33 | Mobile real-time |

| OpenPose | 25 (BODY_25) | Multi-person, older standard |

| AlphaPose | 17 (COCO) | Top-down accuracy |

| MMPose | 17-133 | Widest model catalog |

| DWPose | 133 (full body, face, hands) | The standard for ControlNet pose conditioning |

| RTMPose | 17 | Mobile real-time, under MMPose |

**DWPose** became the de facto pose standard in 2024-2026. The reason is simple - **ControlNet, AnimateDiff, and Stable Video Diffusion all accept DWPose keypoints as conditions.** Pose conditioning in generative AI is mostly DWPose.

from mmpose.apis import inference_topdown, init_model

config = 'configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb320-270e_cocktail14-384x288.py'

checkpoint = 'rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288.pth'

model = init_model(config, checkpoint, device='cuda')

results = inference_topdown(model, 'input.jpg', bboxes=person_boxes)

results[i].pred_instances.keypoints, keypoint_scores

Chapter 16 - Object Tracking - ByteTrack, BoT-SORT, OC-SORT, DEVA

Tracking assigns the same ID to the same object across video frames. The 2026 standard narrows to four:

| Tracker | Input | Notes |

| --- | --- | --- |

| ByteTrack (2022) | Box plus confidence | Two-stage matching for low-confidence boxes. Most widely used |

| BoT-SORT (2022) | Box plus ReID embedding | Camera motion compensation |

| OC-SORT (2023) | Box | Observation-centric, robust to occlusion |

| DEVA (2023) | Box plus mask | Pairs with SAM, video segmentation tracking |

ByteTrack is integrated into Roboflow Supervision, so it is a one-liner:

tracker = sv.ByteTrack(track_thresh=0.5, track_buffer=30)

for frame, detections in stream():

detections = tracker.update_with_detections(detections)

detections.tracker_id holds IDs

Remember also that **SAM 2 itself acts as a tracker**. One click in one frame, and the mask propagates across the whole video. Tracking and segmentation are now merged into one model.

Chapter 17 - Diffusion-Based Vision - ControlNet, IP-Adapter

Generative vision is no longer "another field". ControlNet and IP-Adapter take detection or segmentation outputs as input conditions and generate new images.

- **ControlNet** - takes Canny edge, depth, pose (DWPose), segmentation, normal as conditions

- **IP-Adapter** - takes an image itself as style or content condition

- **T2I-Adapter** - a lighter alternative to ControlNet

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained('thibaud/controlnet-openpose-sdxl-1.0', torch_dtype=torch.float16)

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(

'stabilityai/stable-diffusion-xl-base-1.0',

controlnet=controlnet,

torch_dtype=torch.float16

).to('cuda')

pose_image = compute_dwpose(input_image) # DWPose keypoint image

image = pipe('a person dancing in the rain', image=pose_image).images[0]

This snippet is the standard **"vision recognition -> vision generation"** pipeline. Detection and generation now live in the same toolchain.

Chapter 18 - Embedding Models - CLIP, SigLIP, DINOv2, DINOv3

Models that turn images into vectors are the most-called computer vision models in 2026. Image search, dedup, clustering, and zero-shot classification all run on embeddings.

| --- | --- | --- | --- |

When to pick which embedding:

- **Text-to-image search** -> CLIP or SigLIP 2

- **Image-to-image search** -> DINOv3

- **Downstream classification head training** -> DINOv3

- **OCR or document work** -> SigLIP 2

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained('facebook/dinov3-vitl-pretrain-lvd1689m')

processor = AutoProcessor.from_pretrained('facebook/dinov3-vitl-pretrain-lvd1689m')

inputs = processor(images=image, return_tensors='pt')

with torch.no_grad():

outputs = model(**inputs)

embedding = outputs.last_hidden_state.mean(dim=1) # (1, 1024)

Chapter 19 - Annotation Tools - CVAT, Label Studio, Roboflow, VIA

Labeling tools in 2026 share one common trait - **AI-assist is the default**. SAM, Grounding DINO, and YOLO are called directly inside the tool.

| Tool | License | Strength |

| --- | --- | --- |

| CVAT | MIT, open source | Widest format support, strong on video |

| Label Studio | Apache 2.0 | Unified NLP/audio/image with ML backends |

| Roboflow Annotate | Commercial SaaS | SAM and Grounding DINO integration, collaboration |

| VIA | BSD, open source | Lightweight single HTML file |

**Team-scale work** - Roboflow or CVAT, **model integration** important - Label Studio, **start within 5 minutes** - VIA.

Chapter 20 - Inference Runtimes - ONNX Runtime, TensorRT, OpenVINO

Running trained models fast is what an inference runtime does. The 2026 options:

| Runtime | Target | Strength |

| --- | --- | --- |

| ONNX Runtime | Cross-platform | CPU/GPU/NPU, most standard |

| TensorRT | NVIDIA GPU | Top speed, INT8 and FP8 quantization |

| OpenVINO | Intel CPU/iGPU/NPU | Best on x86 PC |

| CoreML | Apple Silicon | iOS/macOS, leverages ANE |

| TFLite | Mobile (Android) | XNNPACK, Hexagon |

| NCNN | Mobile/embedded | Tencent, ARM-optimized |

| MNN | Mobile/embedded | Alibaba, strong OpenCL |

YOLO can export all of them in one line:

from ultralytics import YOLO

model = YOLO('yolo11n.pt')

model.export(format='onnx', dynamic=True, simplify=True)

model.export(format='engine', half=True) # TensorRT FP16

model.export(format='openvino', int8=True) # OpenVINO INT8

model.export(format='coreml', nms=True) # CoreML

model.export(format='tflite', int8=True) # TFLite INT8

model.export(format='ncnn') # NCNN

**Selection rules.** NVIDIA GPU server -> TensorRT. Intel PC -> OpenVINO. iOS/macOS -> CoreML. Android -> TFLite. Not sure -> ONNX Runtime.

Chapter 21 - Mobile Vision - ML Kit, Vision Framework, MNN

When you ship vision in a mobile app, native frameworks are the safest bet.

**Google ML Kit** (Android/iOS) - faces, barcodes, text, landmarks, translation, pose. On-device or server options.

**Apple Vision Framework** (iOS/macOS) - over 100 vision requests like VNDetectFaceRectangles and VNDetectHumanBodyPoseRequest. VNGenerateForegroundInstanceMaskRequest added in 2024 is mobile SAM.

**MNN** (Alibaba) - the de facto standard in the Chinese mobile ecosystem. Embedded in Alibaba, Pinduoduo, and ByteDance apps.

// Swift / Apple Vision

let request = VNDetectHumanBodyPoseRequest { request, error in

guard let obs = request.results as? [VNHumanBodyPoseObservation] else { return }

for observation in obs {

let points = try? observation.recognizedPoints(.all)

// points holds keypoints

}

let handler = VNImageRequestHandler(cgImage: cgImage)

try? handler.perform([request])

**Rule.** Standard ML features (faces, text, barcodes) - native, fast and free. Custom models - CoreML (iOS) and TFLite (Android), ship your own model.

Chapter 22 - The Korean Vision Ecosystem

Korea's computer vision contributions are heavy on both the academic and industrial sides.

**Naver Labs Europe** - the Grenoble team behind DUSt3R, MASt3R, and CroCo. Frontier in 3D vision worldwide.

**KAIST CVLab** - groups of Professors In So Kweon, Yong Man Ro, and Eunbyung Park. Regular publications at CVPR and ICCV.

**Lunit** - medical imaging AI. Holds many FDA and CE clearances for chest X-ray, mammography, and digital pathology. INSIGHT CXR, MMG, BCC are its flagship products.

**Seerslab** - AR/VR/vision. Naver Z subsidiary, runs face recognition and filter engines for ZEPETO and SNOW.

**VUNO** - medical imaging AI. DeepASR, DeepCT, DeepBrain.

**MakinaRocks** - industrial vision anomaly detection. Semiconductor and display inspection.

**Riiid / Trinity / Innospace** - education and defense.

Korean academia is strong in **faces, OCR, autonomous driving, and medical imaging**. Korea sits in the world's top 4-5 for CVPR publications.

Chapter 23 - The Japanese Vision Ecosystem

**Preferred Networks (PFN)** - Japan's largest AI company. PFN-Vision library, vision for chemistry, materials, and robotics.

**ABEJA** - Tokyo-based vision SaaS. Store analytics for retail and manufacturing.

**Recruit Holdings** - vision recruiting systems via subsidiaries like Indeed and JOBSRU.

**ALBERT (now Accenture Japan)** - industrial AI and vision consulting.

**Nikon AI** - medical imaging and industrial inspection. A camera company expanded into vision AI.

**LeapMind** - strong embedded vision via the Blueoil quantized inference engine.

**SoftBank Robotics** - vision systems on Pepper and NAO.

**Fast Retailing (UNIQLO)** - in-house vision library for store camera analytics.

Japanese academia is strong in **OCR, document vision, and robotics vision**. UTokyo IIS, Kyoto U, Nagoya U, and Tokyo Tech regularly publish at CVPR.

Chapter 24 - The Posture of a 2026 Computer Vision Engineer

Boiling down 24 chapters into a single page:

**(1) Avoid training whenever possible.** A combination of pretrained models, open-vocabulary models, and VLMs covers 80%. Refining a Grounding DINO prompt takes less time than collecting 10,000 images.

**(2) Design pipelines, do not train models.** The job in 2026 is "which model to call, in what order", not "which loss to train on".

**(3) Divide labor between VLMs and traditional models.** Ask the VLM "what to find", ask the traditional model "where it is". This is the standard shape of a 2026 vision system.

**(4) Do not underestimate the inference runtime.** A model that runs at 30 fps in PyTorch hits 200 fps in TensorRT. A 6x gap is a different product in user experience.

**(5) Check the license.** The YOLO family is AGPL. Dropping it straight into a company product triggers source disclosure. Know RT-DETR, D-FINE, and MMDetection as alternatives.

**(6) Leave "pixel work" to OpenCV.** Color spaces, resizing, decoding, encoding - OpenCV is still the fastest and most stable.

**(7) Labeling tools come before datasets.** Start with CVAT or Roboflow with SAM 2 integration. The era of drawing boxes by hand is over.

The future of a vision engineer is **not "the person who makes the model" but "the person who composes the models"**. There are many tools. Stand in the right line.

References

- OpenCV - https://opencv.org/

- OpenCV 5.x roadmap - https://github.com/opencv/opencv/wiki/OpenCV-5

- MediaPipe - https://developers.google.com/mediapipe

- MediaPipe Tasks API - https://developers.google.com/mediapipe/solutions/tasks

- Detectron2 - https://github.com/facebookresearch/detectron2

- Detectron2 Model Zoo - https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md

- Ultralytics YOLO - https://github.com/ultralytics/ultralytics

- Ultralytics docs - https://docs.ultralytics.com/

- MMDetection - https://github.com/open-mmlab/mmdetection

- MMPose - https://github.com/open-mmlab/mmpose

- Roboflow - https://roboflow.com/

- Roboflow Supervision - https://github.com/roboflow/supervision

- HuggingFace Transformers Vision - https://huggingface.co/docs/transformers/main/en/tasks/object_detection

- Segment Anything 2 - https://github.com/facebookresearch/sam2

- SAM 2 paper - https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/

- Grounding DINO - https://github.com/IDEA-Research/GroundingDINO

- Grounding SAM - https://github.com/IDEA-Research/Grounded-Segment-Anything

- Florence-2 - https://huggingface.co/microsoft/Florence-2-large

- YOLO-World - https://github.com/AILab-CVC/YOLO-World

- DUSt3R - https://github.com/naver/dust3r

- MASt3R - https://github.com/naver/mast3r

- VGGT - https://github.com/facebookresearch/vggt

- Depth Anything v2 - https://github.com/DepthAnything/Depth-Anything-V2

- Marigold - https://github.com/prs-eth/Marigold

- Apple DepthPro - https://github.com/apple/ml-depth-pro

- DINOv3 - https://github.com/facebookresearch/dinov3

- CLIP - https://github.com/openai/CLIP

- SigLIP 2 - https://huggingface.co/collections/google/siglip2

- DWPose - https://github.com/IDEA-Research/DWPose

- ByteTrack - https://github.com/ifzhang/ByteTrack

- ONNX Runtime - https://onnxruntime.ai/

- TensorRT - https://developer.nvidia.com/tensorrt

- OpenVINO - https://docs.openvino.ai/

- Apple CoreML - https://developer.apple.com/documentation/coreml

- TFLite - https://www.tensorflow.org/lite

- NCNN - https://github.com/Tencent/ncnn

- MNN - https://github.com/alibaba/MNN

- CVAT - https://github.com/cvat-ai/cvat

- Label Studio - https://labelstud.io/

- Naver Labs Europe - https://europe.naverlabs.com/

- Lunit - https://www.lunit.io/

- Preferred Networks - https://www.preferred.jp/en/