Computer Vision Frameworks 2026 - OpenCV 4, MediaPipe, Detectron2, YOLO v11, MMDetection, SAM 2, Grounding DINO Deep Dive

Prologue - Computer Vision Became "Asking" Instead of "Seeing"

Computer vision in the 2010s had a clear shape. SIFT, HOG, and Haar extracted features, SVMs and random forests classified them, and OpenCV tied it all together. The early 2020s belonged to ResNet, EfficientNet, and Mask R-CNN - 90% of the job was collecting datasets, training models, and squeezing out mAP.

The landscape in 2026 looks different. A single sentence like "estimate the pose of the person in a red helmet in this photo" maps to a three-line pipeline: Grounding DINO catches the box, SAM 2 makes the mask, MMPose extracts the keypoints. We barely train anything. Instead, we design "which model to call, in what order".

This article walks through the 2026 computer vision stack in one breath. From the basics of OpenCV through SAM 2 and VLMs to DINOv3, DUSt3R, and mobile inference - the criteria for choosing the right tool, packed into a single page.

Chapter 1 - The 2026 CV Stack Map

Before diving into individual tools, let me draw the overall map. The 2026 computer vision world splits into five layers.

[Layer 5] Vision Language Model (VLM)
            GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash
            Qwen2-VL, InternVL 2.5, Pixtral 12B
                       |
[Layer 4] Open-Vocabulary Foundation
            Grounding DINO 1.6, Florence-2, YOLO-World
            SAM 2, DINOv3, CLIP, SigLIP
                       |
[Layer 3] Task-Specific Model
            YOLO v11, Detectron3, MMDetection
            MMPose, DWPose, ByteTrack, Depth Anything v3
                       |
[Layer 2] Inference Runtime
            ONNX Runtime, TensorRT, OpenVINO
            CoreML, TFLite, NCNN, MNN
                       |
[Layer 1] Image I/O and Primitives
            OpenCV 4.10, Pillow-SIMD
            FFmpeg, GStreamer

The higher the layer, the more "intelligent" the system gets, but latency rises with it. VLMs run at one or two frames per second, YOLO v11 runs at over 100 fps. The job of a 2026 vision engineer is composing these two ends.

One-line summary: "Ask the question with a VLM, draw the answer with YOLO."

Chapter 2 - OpenCV 4.10 / 5.x - Still the Starting Point for Everything

OpenCV did not die. It got even stronger in 2026. The reason is simple - reading images, cropping, converting color spaces, decoding video frames are all required by any deep learning pipeline.

As of May 2026 OpenCV 4.10 is the LTS, and 5.0 beta is in active development. Three key changes stand out.

First, the DNN module became the default for ONNX inference. You can call YOLO, ResNet, or ViT in one line via cv2.dnn.readNetFromONNX() without going through PyTorch or TensorFlow.

Second, G-API (Graph API) is stable. It expresses input-to-output as a graph and runs on OpenCL, CUDA, or Vulkan backends. Especially powerful on mobile and embedded.

Third, CUDA and OpenCL acceleration are built in. The cv2.cuda module runs Gaussian blur, optical flow, and image warping directly on the GPU.

import cv2

# 1) Read image - color order is BGR (careful!)
img = cv2.imread('input.jpg')
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# 2) Resize - INTER_AREA is best for shrinking
small = cv2.resize(img_rgb, (640, 640), interpolation=cv2.INTER_AREA)

# 3) DNN inference - load ONNX model directly
net = cv2.dnn.readNetFromONNX('yolov11n.onnx')
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

blob = cv2.dnn.blobFromImage(small, 1/255.0, (640, 640), swapRB=True)
net.setInput(blob)
outputs = net.forward()

Two things to remember: OpenCV uses the BGR color space (different from PIL/PyTorch), and imread returns None on failure (it does not raise). These two facts cost someone an hour of debugging every week in 2026.

Chapter 3 - MediaPipe 0.10 / MediaPipe Studio - The New Standard for Mobile Real-Time

Google's MediaPipe went through a major shift in late 2024. The older "MediaPipe Solutions API" merged into the MediaPipe Tasks API, and the no-code train/deploy tool MediaPipe Studio appeared.

As of 2026, MediaPipe offers the following solutions through one-line APIs:

Hand Landmarker - 21 hand keypoints
Pose Landmarker - 33 body keypoints plus segmentation mask
Face Landmarker - 478 facial mesh points plus blendshapes
Image Embedder - MobileNet-V3 embeddings
Object Detector - EfficientDet-Lite
Image Segmenter - Selfie segmentation, hair segmentation
Gesture Recognizer - 7 pre-trained gestures
Image Classifier - EfficientNet-Lite

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

# Pose Landmarker - one-line instance creation
options = vision.PoseLandmarkerOptions(
    base_options=python.BaseOptions(model_asset_path='pose_landmarker.task'),
    running_mode=vision.RunningMode.VIDEO,
    num_poses=2,
    min_pose_detection_confidence=0.5,
)
landmarker = vision.PoseLandmarker.create_from_options(options)

# Inference per frame
result = landmarker.detect_for_video(mp_image, timestamp_ms)
for pose in result.pose_landmarks:
    for lm in pose:
        print(lm.x, lm.y, lm.z, lm.visibility)

MediaPipe's real value is the guaranteed 30 to 60 FPS on mobile. The same task in PyTorch barely manages 5 FPS on a phone. CPU/GPU/NPU auto-dispatch, TFLite optimization, and the XNNPACK backend are bundled together.

The limit is also clear - if the task is not predefined, you cannot use it, and training your own model requires the MediaPipe Model Maker detour. MediaPipe owns the "do a fixed job, fast" niche.

Chapter 4 - Detectron2 / Detectron3 - Meta's Orthodox Detection Toolkit

Meta AI Research's Detectron2 has been the de facto academic standard since its 2019 release. Detectron3 entered beta in late 2025, and as of 2026 the two coexist.

The differences:

Item	Detectron2	Detectron3
Default backbones	ResNet, ViT	ConvNeXt v2, DINOv3, SAM2 encoder
Detection heads	Mask R-CNN, Cascade R-CNN	Mask R-CNN, Mask2Former, ViTDet
Training framework	PyTorch 1.x/2.x	PyTorch 2.5+, torch.compile by default
Config system	YACS (yaml)	LazyConfig (pythonic)
Distributed training	DDP	FSDP plus activation checkpoint

A Detectron2 code snippet:

from detectron2 import model_zoo
from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
    "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
))
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
    "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
)
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5

predictor = DefaultPredictor(cfg)
outputs = predictor(image)
# outputs["instances"].pred_boxes, pred_masks, pred_classes

Detectron3's LazyConfig replaces yaml with Python objects. IDE autocompletion and type checking work, and conditional logic stays clean.

When to use Detectron? When reproducing papers, comparing COCO/LVIS benchmarks, or when you need a "standard" Mask R-CNN baseline. In production many teams migrate to YOLO or MMDetection.

Chapter 5 - The YOLO Family - From v8 to v12

Ultralytics-managed YOLO had v8 in 2024, v9 in late 2024, v10 in 2025, v11 in late 2025, and v12 in early 2026. No other vision framework matches this pace of major releases.

Summary:

Version	Release	Key change	License
YOLOv8	2023	Anchor-free, unified classification/segmentation	AGPL-3.0
YOLOv9	2024	PGI (Programmable Gradient Information), GELAN	AGPL-3.0
YOLOv10	2024	NMS-free head, end-to-end training	AGPL-3.0
YOLOv11	2025	C3k2 block, SPPF plus C2PSA, fewer parameters	AGPL-3.0
YOLOv12	2026	Attention-centric architecture, FlashAttention-based	AGPL-3.0

YOLO's appeal is summarized in a single block:

from ultralytics import YOLO

# 1) Load - eight tasks share one API
model = YOLO('yolo11n.pt')         # nano
# model = YOLO('yolo11n-seg.pt')   # segmentation
# model = YOLO('yolo11n-pose.pt')  # pose
# model = YOLO('yolo11n-obb.pt')   # oriented bounding box
# model = YOLO('yolo11n-cls.pt')   # classification

# 2) Inference
results = model('input.jpg')
for r in results:
    print(r.boxes.xyxy)    # coordinates
    print(r.boxes.conf)    # confidence
    print(r.boxes.cls)     # class

# 3) Training
model.train(data='coco.yaml', epochs=100, imgsz=640)

# 4) Export - ONNX, TensorRT, CoreML, TFLite all in one line
model.export(format='onnx')
model.export(format='engine')      # TensorRT
model.export(format='coreml')

Warning about AGPL. Using YOLO models in a SaaS or web service triggers source disclosure obligations. Commercial use requires the Ultralytics Enterprise license. Some companies use RT-DETR, DAMO-YOLO, or D-FINE under Apache-2.0 to avoid this.

Chapter 6 - MMDetection / MMCV / OpenMMLab - The Widest Catalog

OpenMMLab, run by the Shanghai AI Lab, owns the widest model catalog in vision. It has more than 10 sub-projects including MMDetection (detection), MMSegmentation (segmentation), MMPose (pose), MMTracking (tracking), MMDetection3D (3D detection), and MMYOLO (unified YOLO).

Two distinguishing traits:

First, all models share one config system. Swapping a Mask R-CNN backbone to ConvNeXt, FPN to BiFPN, or the head to DETR is a few yaml lines.

Second, benchmark reproducibility is strong. Reaching paper results within plus or minus 0.3 mAP is normal. It is the framework closest to the academic standard.

from mmdet.apis import init_detector, inference_detector

config = 'configs/yolox/yolox_s_8xb8-300e_coco.py'
checkpoint = 'yolox_s.pth'

model = init_detector(config, checkpoint, device='cuda:0')
result = inference_detector(model, 'demo.jpg')
# result.pred_instances.bboxes, scores, labels

The downside of MMDetection is the steep learning curve. You have to understand the config system, the Registry pattern, and the Hook system before plugging in your own model. Better suited for "going deep" than "starting fast".

Chapter 7 - Roboflow Universe Plus Supervision - From Annotation to Training

Roboflow sits in the GitHub spot for vision data. As of 2026 Roboflow Universe hosts over 300,000 public datasets and over 50,000 pretrained models.

Two key tools:

Roboflow Annotate - Web-based annotation. Boxes, polygons, keypoints, and OBB are all supported. The auto-label feature calls SAM and Grounding DINO to draft proposals.

Supervision - The vision utility kit Roboflow open-sourced. Visualization, filtering, metrics, and trackers all live in one package.

import supervision as sv
from ultralytics import YOLO

model = YOLO('yolo11n.pt')
results = model('input.jpg')[0]

# Convert to Roboflow Supervision's Detection object
detections = sv.Detections.from_ultralytics(results)

# Visualize - box plus label in one line
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator()

annotated = box_annotator.annotate(scene=image, detections=detections)
annotated = label_annotator.annotate(scene=annotated, detections=detections)

# Tracking - ByteTrack in one line
tracker = sv.ByteTrack()
detections = tracker.update_with_detections(detections)

The strength of Supervision is that it separates model inference from visualization and metrics. YOLO, Detectron, and MMDetection outputs all unify under the same Detections object.

Chapter 8 - HuggingFace Transformers Vision - From ViT to DETR

HuggingFace Transformers is not just for NLP. As of 2026 more than 200 vision models are registered.

Representative catalog:

ViT (Vision Transformer) - the classification standard
DETR / Deformable DETR / DINO - transformer-based detection
Mask2Former / OneFormer - unified segmentation
OWL-ViT / OWLv2 - open-vocabulary detection
CLIP / SigLIP / SigLIP 2 - image-text embeddings
DINOv2 / DINOv3 - self-supervised backbones
SAM / SAM 2 - segmentation
Depth Anything v2 / v3 - depth estimation

from transformers import pipeline

# Classification - one line
classifier = pipeline('image-classification', model='google/vit-base-patch16-224')
print(classifier('input.jpg'))

# Detection
detector = pipeline('object-detection', model='facebook/detr-resnet-50')

# Open-vocabulary detection
detector = pipeline('zero-shot-object-detection', model='google/owlv2-base-patch16-ensemble')
print(detector('input.jpg', candidate_labels=['cat', 'dog', 'person']))

# Segmentation
segmenter = pipeline('image-segmentation', model='facebook/mask2former-swin-large-coco-panoptic')

HuggingFace's appeal is one-line inference via pipeline(). Swapping models is just swapping a string. Training is better done elsewhere though - HF Trainer feels awkward for vision compared to PyTorch Lightning or MMDetection.

Chapter 9 - Segment Anything 2 (SAM 2) - The New Standard for Video Masks

Meta's SAM (Segment Anything Model) first appeared in April 2023. SAM 2 released in July 2024 added memory attention for video, not just images. Catch a mask in one frame, and the rest are tracked automatically.

As of 2026, the SAM family:

Model	Release	Parameters	Notes
SAM	2023	91M-636M	Image segmentation
SAM 2	2024	39M-224M	Video plus image
SAM 2.1	late 2024	same	Better on small objects and occlusion
SAMURAI	2024	same	Kalman-filter-based tracking boost
FastSAM	2023	68M	YOLOv8-seg backbone, 50x faster
MobileSAM	2023	9.8M	Lightweight for mobile
EfficientSAM	2023	26M	KD-compressed

SAM 2 usage:

from sam2.build_sam import build_sam2_video_predictor

predictor = build_sam2_video_predictor(
    'configs/sam2_hiera_l.yaml',
    'checkpoints/sam2_hiera_large.pt'
)

# Init video
state = predictor.init_state(video_path='video.mp4')

# One click in first frame - mask auto-tracked
predictor.add_new_points(
    inference_state=state,
    frame_idx=0,
    obj_id=1,
    points=[[210, 350]],
    labels=[1]
)

# Track masks across all frames
for frame_idx, obj_ids, masks in predictor.propagate_in_video(state):
    # masks shape: (num_objects, H, W)
    pass

SAM 2's value is that once you teach it, it follows forever. Labeling cost no longer scales with video length. CVAT, Label Studio, and Roboflow all adopted SAM 2 integration as a built-in feature.

Chapter 10 - Grounding DINO 1.5 / 1.6 - Drawing Boxes from Text

IDEA Research's Grounding DINO is the model that made "open-vocabulary detection" the standard. The 1.0 came in 2023, 1.5 Pro/Edge in 2024, and 1.6 in late 2024.

Traditional YOLO and Detectron only detect classes (80 or 1203) seen during training. Grounding DINO is different - it draws boxes from natural language prompts like "red car", "door handle", or "person holding an umbrella".

from groundingdino.util.inference import load_model, load_image, predict

model = load_model('GroundingDINO_SwinT_OGC.cfg.py', 'groundingdino_swint_ogc.pth')
image_source, image = load_image('input.jpg')

# Natural language prompt - separate noun phrases with periods
TEXT_PROMPT = 'red car. person holding umbrella. door handle.'
BOX_THRESHOLD = 0.35
TEXT_THRESHOLD = 0.25

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_THRESHOLD,
    text_threshold=TEXT_THRESHOLD,
)

Grounding SAM - the pipeline of Grounding DINO catching boxes plus SAM (or SAM 2) generating masks inside those boxes. The de facto starting point for labeling in 2026.

# 1) Grounding DINO for boxes
boxes, _, _ = predict(model_gdino, image, 'cat. dog.', 0.35, 0.25)

# 2) SAM for masks
sam_predictor.set_image(image_source)
masks, _, _ = sam_predictor.predict(box=boxes, multimask_output=False)

These two lines are the entirety of "object segmentation without a dataset". Drawing boxes on 10,000 images by hand collapses into a 30-minute script.

Chapter 11 - Florence-2 and YOLO-World - Other Open-Vocabulary Contenders

There are two more strong open-vocabulary players besides Grounding DINO.

Florence-2 (Microsoft, 2024) - handles classification, captioning, detection, segmentation, and OCR with one model. Very compact at 0.23B and 0.77B parameters, but the quality is high. Instead of natural language prompts it uses task tokens like <OD> (detection), <DENSE_REGION_CAPTION>, and <REFERRING_EXPRESSION_SEGMENTATION>.

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained('microsoft/Florence-2-large', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('microsoft/Florence-2-large', trust_remote_code=True)

prompt = '<OD>'  # Object Detection task token
inputs = processor(text=prompt, images=image, return_tensors='pt')
generated_ids = model.generate(**inputs, max_new_tokens=1024, num_beams=3)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed = processor.post_process_generation(generated_text, task='<OD>', image_size=(W, H))

YOLO-World (Tencent, 2024) - YOLO speed with open-vocabulary on top. 10-20x faster than Grounding DINO, detects from natural language prompts without training.

from ultralytics import YOLOWorld

model = YOLOWorld('yolov8x-worldv2.pt')
model.set_classes(['red car', 'door handle', 'person holding umbrella'])
results = model.predict('input.jpg')

When to pick what. Quality first - Grounding DINO 1.6, speed first - YOLO-World, multiple tasks in one model - Florence-2. These three form the 2026 open-vocabulary triangle.

Chapter 12 - VLM (Vision Language Model) - "Ask the Image"

After GPT-4V appeared in 2024, VLMs became a new layer in computer vision. As of May 2026 the major VLMs are:

Closed source (API)

GPT-4o / GPT-4.5-vision (OpenAI)
Claude 3.5 Sonnet / Claude 4 Opus (Anthropic)
Gemini 2.0 Flash / Gemini 2.0 Pro (Google)

Open source

Qwen2-VL / Qwen2.5-VL (Alibaba) - 2B/7B/72B
InternVL 2.5 (OpenGVLab) - 1B/2B/4B/8B/26B/40B/76B
Llava-OneVision (Bytedance/UW) - 0.5B/7B/72B
CogVLM2 (Zhipu) - 19B
Phi-3.5-vision (Microsoft) - 4.2B
Pixtral 12B (Mistral) - 12B

VLM usage splits into two patterns.

(a) Vision QA - "How many people are in this photo?"

from anthropic import Anthropic
import base64

client = Anthropic()
with open('input.jpg', 'rb') as f:
    img_b64 = base64.standard_b64encode(f.read()).decode('utf-8')

message = client.messages.create(
    model='claude-3-5-sonnet-20241022',
    max_tokens=1024,
    messages=[{
        'role': 'user',
        'content': [
            {'type': 'image', 'source': {'type': 'base64', 'media_type': 'image/jpeg', 'data': img_b64}},
            {'type': 'text', 'text': 'How many people are in this image?'}
        ]
    }]
)

(b) Structured Output - "Extract items and prices from this receipt as JSON"

# Forcing a JSON schema ends OCR plus parsing in one call
schema = {
    'type': 'object',
    'properties': {
        'items': {'type': 'array', 'items': {
            'type': 'object',
            'properties': {
                'name': {'type': 'string'},
                'price': {'type': 'number'},
                'quantity': {'type': 'integer'}
            }
        }},
        'total': {'type': 'number'}
    }
}

VLM limits are also clear. (1) Weak coordinates - "third from the top-left" is fine, but exact pixel coordinates are inaccurate. (2) Expensive and slow - 10,000 images on GPT-4o costs 10-50 dollars, while YOLO is free and finishes in a minute. (3) Not deterministic - the same question can produce different answers.

Hence the pattern: ask "what to look for" with a VLM, find "where it is" with a traditional model.

Chapter 13 - 3D Vision - DUSt3R, MASt3R, VGGT

The biggest change in 3D vision in 2024-2025 was that 3D reconstruction from just two photos became possible.

DUSt3R (Naver Labs Europe, 2024) - takes two images and directly regresses pixel-wise 3D pointmaps. Works without knowing camera intrinsics. Compressed the complex SfM and MVS pipelines into one model.

MASt3R (Naver Labs, 2024) - DUSt3R plus matching. Outputs pixel correspondences between two images. Directly usable for SLAM and localization.

VGGT (Meta, 2025) - Visual Geometry Grounded Transformer. Takes multiple images at once and estimates camera poses, depth maps, and pointmaps simultaneously. Overcomes the pairwise limit of DUSt3R.

Spann3R (2024) - sequential 3D reconstruction via memory tokens. Closer to video SLAM results.

from dust3r.inference import inference
from dust3r.model import AsymmetricCroCo3DStereo

model = AsymmetricCroCo3DStereo.from_pretrained('naver/DUSt3R_ViTLarge_BaseDecoder_512_dpt')

images = load_images(['img1.jpg', 'img2.jpg'], size=512)
pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)
output = inference(pairs, model, device='cuda', batch_size=1)
# extract pointmap, confidence, depth from output

This is a field where Korean contributions are large. The Grenoble team at Naver Labs Europe made DUSt3R, MASt3R, and CroCo.

Chapter 14 - Depth Anything v2/v3, Marigold, DepthPro

Monocular depth estimation exploded in 2024-2025.

Depth Anything v2 (HKU, 2024) - a strong depth model trained on 62 million unlabeled images. Four sizes: Small (24M), Base (97M), Large (335M), Giant (1.3B). Depth Anything v3 (2025) strengthened video consistency and metric depth.

Marigold (ETH Zürich, 2024) - Stable Diffusion fine-tuned for depth. Diffusion-based means good detail, but slow.

DepthPro (Apple, 2024) - estimates metric depth from one image in 0.3 seconds. The basis for iPhone depth without LiDAR.

from transformers import pipeline

pipe = pipeline(task='depth-estimation', model='depth-anything/Depth-Anything-V2-Large-hf')
depth = pipe('input.jpg')['depth']  # PIL Image

When to pick what. Relative depth is enough - Depth Anything, metric depth needed - DepthPro, detail first - Marigold.

Chapter 15 - Pose Estimation - MMPose, OpenPose, AlphaPose, DWPose

Pose estimation finds keypoints. The 2026 standard catalog:

Tool	Keypoints	Notes
MediaPipe Pose	33	Mobile real-time
OpenPose	25 (BODY_25)	Multi-person, older standard
AlphaPose	17 (COCO)	Top-down accuracy
MMPose	17-133	Widest model catalog
DWPose	133 (full body, face, hands)	The standard for ControlNet pose conditioning
RTMPose	17	Mobile real-time, under MMPose

DWPose became the de facto pose standard in 2024-2026. The reason is simple - ControlNet, AnimateDiff, and Stable Video Diffusion all accept DWPose keypoints as conditions. Pose conditioning in generative AI is mostly DWPose.

from mmpose.apis import inference_topdown, init_model

config = 'configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb320-270e_cocktail14-384x288.py'
checkpoint = 'rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288.pth'

model = init_model(config, checkpoint, device='cuda')
results = inference_topdown(model, 'input.jpg', bboxes=person_boxes)
# results[i].pred_instances.keypoints, keypoint_scores

Chapter 16 - Object Tracking - ByteTrack, BoT-SORT, OC-SORT, DEVA

Tracking assigns the same ID to the same object across video frames. The 2026 standard narrows to four:

Tracker	Input	Notes
ByteTrack (2022)	Box plus confidence	Two-stage matching for low-confidence boxes. Most widely used
BoT-SORT (2022)	Box plus ReID embedding	Camera motion compensation
OC-SORT (2023)	Box	Observation-centric, robust to occlusion
DEVA (2023)	Box plus mask	Pairs with SAM, video segmentation tracking

ByteTrack is integrated into Roboflow Supervision, so it is a one-liner:

import supervision as sv

tracker = sv.ByteTrack(track_thresh=0.5, track_buffer=30)

for frame, detections in stream():
    detections = tracker.update_with_detections(detections)
    # detections.tracker_id holds IDs

Remember also that SAM 2 itself acts as a tracker. One click in one frame, and the mask propagates across the whole video. Tracking and segmentation are now merged into one model.

Chapter 17 - Diffusion-Based Vision - ControlNet, IP-Adapter

Generative vision is no longer "another field". ControlNet and IP-Adapter take detection or segmentation outputs as input conditions and generate new images.

ControlNet - takes Canny edge, depth, pose (DWPose), segmentation, normal as conditions
IP-Adapter - takes an image itself as style or content condition
T2I-Adapter - a lighter alternative to ControlNet

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
import torch

controlnet = ControlNetModel.from_pretrained('thibaud/controlnet-openpose-sdxl-1.0', torch_dtype=torch.float16)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    controlnet=controlnet,
    torch_dtype=torch.float16
).to('cuda')

pose_image = compute_dwpose(input_image)  # DWPose keypoint image
image = pipe('a person dancing in the rain', image=pose_image).images[0]

This snippet is the standard "vision recognition -> vision generation" pipeline. Detection and generation now live in the same toolchain.

Chapter 18 - Embedding Models - CLIP, SigLIP, DINOv2, DINOv3

Models that turn images into vectors are the most-called computer vision models in 2026. Image search, dedup, clustering, and zero-shot classification all run on embeddings.

Model	Training signal	Dim	Strength
CLIP (2021)	Image-text pairs	512/768	Aligned with text
OpenCLIP	Same, larger data	Same	Stronger baseline
SigLIP (2023)	Sigmoid loss	Same	More efficient than CLIP
SigLIP 2 (2024)	Multi-task	Same	Strong on OCR, documents
DINOv2 (2023)	Self-supervised (SSL)	768/1024/1536	Text-independent, strong features
DINOv3 (2025)	Self-supervised, larger	Same	Successor to DINOv2

When to pick which embedding:

Text-to-image search -> CLIP or SigLIP 2
Image-to-image search -> DINOv3
Downstream classification head training -> DINOv3
OCR or document work -> SigLIP 2

from transformers import AutoModel, AutoProcessor
import torch

model = AutoModel.from_pretrained('facebook/dinov3-vitl-pretrain-lvd1689m')
processor = AutoProcessor.from_pretrained('facebook/dinov3-vitl-pretrain-lvd1689m')

inputs = processor(images=image, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)

embedding = outputs.last_hidden_state.mean(dim=1)  # (1, 1024)

Chapter 19 - Annotation Tools - CVAT, Label Studio, Roboflow, VIA

Labeling tools in 2026 share one common trait - AI-assist is the default. SAM, Grounding DINO, and YOLO are called directly inside the tool.

Tool	License	Strength
CVAT	MIT, open source	Widest format support, strong on video
Label Studio	Apache 2.0	Unified NLP/audio/image with ML backends
Roboflow Annotate	Commercial SaaS	SAM and Grounding DINO integration, collaboration
VIA	BSD, open source	Lightweight single HTML file

Team-scale work - Roboflow or CVAT, model integration important - Label Studio, start within 5 minutes - VIA.

Chapter 20 - Inference Runtimes - ONNX Runtime, TensorRT, OpenVINO

Running trained models fast is what an inference runtime does. The 2026 options:

Runtime	Target	Strength
ONNX Runtime	Cross-platform	CPU/GPU/NPU, most standard
TensorRT	NVIDIA GPU	Top speed, INT8 and FP8 quantization
OpenVINO	Intel CPU/iGPU/NPU	Best on x86 PC
CoreML	Apple Silicon	iOS/macOS, leverages ANE
TFLite	Mobile (Android)	XNNPACK, Hexagon
NCNN	Mobile/embedded	Tencent, ARM-optimized
MNN	Mobile/embedded	Alibaba, strong OpenCL

YOLO can export all of them in one line:

from ultralytics import YOLO
model = YOLO('yolo11n.pt')

model.export(format='onnx', dynamic=True, simplify=True)
model.export(format='engine', half=True)        # TensorRT FP16
model.export(format='openvino', int8=True)      # OpenVINO INT8
model.export(format='coreml', nms=True)         # CoreML
model.export(format='tflite', int8=True)        # TFLite INT8
model.export(format='ncnn')                     # NCNN

Selection rules. NVIDIA GPU server -> TensorRT. Intel PC -> OpenVINO. iOS/macOS -> CoreML. Android -> TFLite. Not sure -> ONNX Runtime.

Chapter 21 - Mobile Vision - ML Kit, Vision Framework, MNN

When you ship vision in a mobile app, native frameworks are the safest bet.

Google ML Kit (Android/iOS) - faces, barcodes, text, landmarks, translation, pose. On-device or server options.

Apple Vision Framework (iOS/macOS) - over 100 vision requests like VNDetectFaceRectangles and VNDetectHumanBodyPoseRequest. VNGenerateForegroundInstanceMaskRequest added in 2024 is mobile SAM.

MNN (Alibaba) - the de facto standard in the Chinese mobile ecosystem. Embedded in Alibaba, Pinduoduo, and ByteDance apps.

// Swift / Apple Vision
import Vision

let request = VNDetectHumanBodyPoseRequest { request, error in
    guard let obs = request.results as? [VNHumanBodyPoseObservation] else { return }
    for observation in obs {
        let points = try? observation.recognizedPoints(.all)
        // points holds keypoints
    }
}
let handler = VNImageRequestHandler(cgImage: cgImage)
try? handler.perform([request])

Rule. Standard ML features (faces, text, barcodes) - native, fast and free. Custom models - CoreML (iOS) and TFLite (Android), ship your own model.

Chapter 22 - The Korean Vision Ecosystem

Korea's computer vision contributions are heavy on both the academic and industrial sides.

Naver Labs Europe - the Grenoble team behind DUSt3R, MASt3R, and CroCo. Frontier in 3D vision worldwide.

KAIST CVLab - groups of Professors In So Kweon, Yong Man Ro, and Eunbyung Park. Regular publications at CVPR and ICCV.

Lunit - medical imaging AI. Holds many FDA and CE clearances for chest X-ray, mammography, and digital pathology. INSIGHT CXR, MMG, BCC are its flagship products.

Seerslab - AR/VR/vision. Naver Z subsidiary, runs face recognition and filter engines for ZEPETO and SNOW.

VUNO - medical imaging AI. DeepASR, DeepCT, DeepBrain.

MakinaRocks - industrial vision anomaly detection. Semiconductor and display inspection.

Riiid / Trinity / Innospace - education and defense.

Korean academia is strong in faces, OCR, autonomous driving, and medical imaging. Korea sits in the world's top 4-5 for CVPR publications.

Chapter 23 - The Japanese Vision Ecosystem

Preferred Networks (PFN) - Japan's largest AI company. PFN-Vision library, vision for chemistry, materials, and robotics.

ABEJA - Tokyo-based vision SaaS. Store analytics for retail and manufacturing.

Recruit Holdings - vision recruiting systems via subsidiaries like Indeed and JOBSRU.

ALBERT (now Accenture Japan) - industrial AI and vision consulting.

Nikon AI - medical imaging and industrial inspection. A camera company expanded into vision AI.

LeapMind - strong embedded vision via the Blueoil quantized inference engine.

SoftBank Robotics - vision systems on Pepper and NAO.

Fast Retailing (UNIQLO) - in-house vision library for store camera analytics.

Japanese academia is strong in OCR, document vision, and robotics vision. UTokyo IIS, Kyoto U, Nagoya U, and Tokyo Tech regularly publish at CVPR.

Chapter 24 - The Posture of a 2026 Computer Vision Engineer

Boiling down 24 chapters into a single page:

(1) Avoid training whenever possible. A combination of pretrained models, open-vocabulary models, and VLMs covers 80%. Refining a Grounding DINO prompt takes less time than collecting 10,000 images.

(2) Design pipelines, do not train models. The job in 2026 is "which model to call, in what order", not "which loss to train on".

(3) Divide labor between VLMs and traditional models. Ask the VLM "what to find", ask the traditional model "where it is". This is the standard shape of a 2026 vision system.

(4) Do not underestimate the inference runtime. A model that runs at 30 fps in PyTorch hits 200 fps in TensorRT. A 6x gap is a different product in user experience.

(5) Check the license. The YOLO family is AGPL. Dropping it straight into a company product triggers source disclosure. Know RT-DETR, D-FINE, and MMDetection as alternatives.

(6) Leave "pixel work" to OpenCV. Color spaces, resizing, decoding, encoding - OpenCV is still the fastest and most stable.

(7) Labeling tools come before datasets. Start with CVAT or Roboflow with SAM 2 integration. The era of drawing boxes by hand is over.

The future of a vision engineer is not "the person who makes the model" but "the person who composes the models". There are many tools. Stand in the right line.

References

OpenCV - https://opencv.org/
OpenCV 5.x roadmap - https://github.com/opencv/opencv/wiki/OpenCV-5
MediaPipe - https://developers.google.com/mediapipe
MediaPipe Tasks API - https://developers.google.com/mediapipe/solutions/tasks
Detectron2 - https://github.com/facebookresearch/detectron2
Detectron2 Model Zoo - https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md
Ultralytics YOLO - https://github.com/ultralytics/ultralytics
Ultralytics docs - https://docs.ultralytics.com/
MMDetection - https://github.com/open-mmlab/mmdetection
MMPose - https://github.com/open-mmlab/mmpose
Roboflow - https://roboflow.com/
Roboflow Supervision - https://github.com/roboflow/supervision
HuggingFace Transformers Vision - https://huggingface.co/docs/transformers/main/en/tasks/object_detection
Segment Anything 2 - https://github.com/facebookresearch/sam2
SAM 2 paper - https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/
Grounding DINO - https://github.com/IDEA-Research/GroundingDINO
Grounding SAM - https://github.com/IDEA-Research/Grounded-Segment-Anything
Florence-2 - https://huggingface.co/microsoft/Florence-2-large
YOLO-World - https://github.com/AILab-CVC/YOLO-World
DUSt3R - https://github.com/naver/dust3r
MASt3R - https://github.com/naver/mast3r
VGGT - https://github.com/facebookresearch/vggt
Depth Anything v2 - https://github.com/DepthAnything/Depth-Anything-V2
Marigold - https://github.com/prs-eth/Marigold
Apple DepthPro - https://github.com/apple/ml-depth-pro
DINOv3 - https://github.com/facebookresearch/dinov3
CLIP - https://github.com/openai/CLIP
SigLIP 2 - https://huggingface.co/collections/google/siglip2
DWPose - https://github.com/IDEA-Research/DWPose
ByteTrack - https://github.com/ifzhang/ByteTrack
ONNX Runtime - https://onnxruntime.ai/
TensorRT - https://developer.nvidia.com/tensorrt
OpenVINO - https://docs.openvino.ai/
Apple CoreML - https://developer.apple.com/documentation/coreml
TFLite - https://www.tensorflow.org/lite
NCNN - https://github.com/Tencent/ncnn
MNN - https://github.com/alibaba/MNN
CVAT - https://github.com/cvat-ai/cvat
Label Studio - https://labelstud.io/
Naver Labs Europe - https://europe.naverlabs.com/
Lunit - https://www.lunit.io/
Preferred Networks - https://www.preferred.jp/en/