Skip to content
Published on

Computer Vision Frameworks 2026 - OpenCV 4, MediaPipe, Detectron2, YOLO v11, MMDetection, SAM 2, Grounding DINO Deep Dive

Authors

Prologue - Computer Vision Became "Asking" Instead of "Seeing"

Computer vision in the 2010s had a clear shape. SIFT, HOG, and Haar extracted features, SVMs and random forests classified them, and OpenCV tied it all together. The early 2020s belonged to ResNet, EfficientNet, and Mask R-CNN - 90% of the job was collecting datasets, training models, and squeezing out mAP.

The landscape in 2026 looks different. A single sentence like "estimate the pose of the person in a red helmet in this photo" maps to a three-line pipeline: Grounding DINO catches the box, SAM 2 makes the mask, MMPose extracts the keypoints. We barely train anything. Instead, we design "which model to call, in what order".

This article walks through the 2026 computer vision stack in one breath. From the basics of OpenCV through SAM 2 and VLMs to DINOv3, DUSt3R, and mobile inference - the criteria for choosing the right tool, packed into a single page.


Chapter 1 - The 2026 CV Stack Map

Before diving into individual tools, let me draw the overall map. The 2026 computer vision world splits into five layers.

[Layer 5] Vision Language Model (VLM)
            GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash
            Qwen2-VL, InternVL 2.5, Pixtral 12B
                       |
[Layer 4] Open-Vocabulary Foundation
            Grounding DINO 1.6, Florence-2, YOLO-World
            SAM 2, DINOv3, CLIP, SigLIP
                       |
[Layer 3] Task-Specific Model
            YOLO v11, Detectron3, MMDetection
            MMPose, DWPose, ByteTrack, Depth Anything v3
                       |
[Layer 2] Inference Runtime
            ONNX Runtime, TensorRT, OpenVINO
            CoreML, TFLite, NCNN, MNN
                       |
[Layer 1] Image I/O and Primitives
            OpenCV 4.10, Pillow-SIMD
            FFmpeg, GStreamer

The higher the layer, the more "intelligent" the system gets, but latency rises with it. VLMs run at one or two frames per second, YOLO v11 runs at over 100 fps. The job of a 2026 vision engineer is composing these two ends.

One-line summary: "Ask the question with a VLM, draw the answer with YOLO."


Chapter 2 - OpenCV 4.10 / 5.x - Still the Starting Point for Everything

OpenCV did not die. It got even stronger in 2026. The reason is simple - reading images, cropping, converting color spaces, decoding video frames are all required by any deep learning pipeline.

As of May 2026 OpenCV 4.10 is the LTS, and 5.0 beta is in active development. Three key changes stand out.

First, the DNN module became the default for ONNX inference. You can call YOLO, ResNet, or ViT in one line via cv2.dnn.readNetFromONNX() without going through PyTorch or TensorFlow.

Second, G-API (Graph API) is stable. It expresses input-to-output as a graph and runs on OpenCL, CUDA, or Vulkan backends. Especially powerful on mobile and embedded.

Third, CUDA and OpenCL acceleration are built in. The cv2.cuda module runs Gaussian blur, optical flow, and image warping directly on the GPU.

import cv2

# 1) Read image - color order is BGR (careful!)
img = cv2.imread('input.jpg')
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# 2) Resize - INTER_AREA is best for shrinking
small = cv2.resize(img_rgb, (640, 640), interpolation=cv2.INTER_AREA)

# 3) DNN inference - load ONNX model directly
net = cv2.dnn.readNetFromONNX('yolov11n.onnx')
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

blob = cv2.dnn.blobFromImage(small, 1/255.0, (640, 640), swapRB=True)
net.setInput(blob)
outputs = net.forward()

Two things to remember: OpenCV uses the BGR color space (different from PIL/PyTorch), and imread returns None on failure (it does not raise). These two facts cost someone an hour of debugging every week in 2026.


Chapter 3 - MediaPipe 0.10 / MediaPipe Studio - The New Standard for Mobile Real-Time

Google's MediaPipe went through a major shift in late 2024. The older "MediaPipe Solutions API" merged into the MediaPipe Tasks API, and the no-code train/deploy tool MediaPipe Studio appeared.

As of 2026, MediaPipe offers the following solutions through one-line APIs:

  • Hand Landmarker - 21 hand keypoints
  • Pose Landmarker - 33 body keypoints plus segmentation mask
  • Face Landmarker - 478 facial mesh points plus blendshapes
  • Image Embedder - MobileNet-V3 embeddings
  • Object Detector - EfficientDet-Lite
  • Image Segmenter - Selfie segmentation, hair segmentation
  • Gesture Recognizer - 7 pre-trained gestures
  • Image Classifier - EfficientNet-Lite
import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

# Pose Landmarker - one-line instance creation
options = vision.PoseLandmarkerOptions(
    base_options=python.BaseOptions(model_asset_path='pose_landmarker.task'),
    running_mode=vision.RunningMode.VIDEO,
    num_poses=2,
    min_pose_detection_confidence=0.5,
)
landmarker = vision.PoseLandmarker.create_from_options(options)

# Inference per frame
result = landmarker.detect_for_video(mp_image, timestamp_ms)
for pose in result.pose_landmarks:
    for lm in pose:
        print(lm.x, lm.y, lm.z, lm.visibility)

MediaPipe's real value is the guaranteed 30 to 60 FPS on mobile. The same task in PyTorch barely manages 5 FPS on a phone. CPU/GPU/NPU auto-dispatch, TFLite optimization, and the XNNPACK backend are bundled together.

The limit is also clear - if the task is not predefined, you cannot use it, and training your own model requires the MediaPipe Model Maker detour. MediaPipe owns the "do a fixed job, fast" niche.


Chapter 4 - Detectron2 / Detectron3 - Meta's Orthodox Detection Toolkit

Meta AI Research's Detectron2 has been the de facto academic standard since its 2019 release. Detectron3 entered beta in late 2025, and as of 2026 the two coexist.

The differences:

ItemDetectron2Detectron3
Default backbonesResNet, ViTConvNeXt v2, DINOv3, SAM2 encoder
Detection headsMask R-CNN, Cascade R-CNNMask R-CNN, Mask2Former, ViTDet
Training frameworkPyTorch 1.x/2.xPyTorch 2.5+, torch.compile by default
Config systemYACS (yaml)LazyConfig (pythonic)
Distributed trainingDDPFSDP plus activation checkpoint

A Detectron2 code snippet:

from detectron2 import model_zoo
from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
    "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
))
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
    "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
)
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5

predictor = DefaultPredictor(cfg)
outputs = predictor(image)
# outputs["instances"].pred_boxes, pred_masks, pred_classes

Detectron3's LazyConfig replaces yaml with Python objects. IDE autocompletion and type checking work, and conditional logic stays clean.

When to use Detectron? When reproducing papers, comparing COCO/LVIS benchmarks, or when you need a "standard" Mask R-CNN baseline. In production many teams migrate to YOLO or MMDetection.


Chapter 5 - The YOLO Family - From v8 to v12

Ultralytics-managed YOLO had v8 in 2024, v9 in late 2024, v10 in 2025, v11 in late 2025, and v12 in early 2026. No other vision framework matches this pace of major releases.

Summary:

VersionReleaseKey changeLicense
YOLOv82023Anchor-free, unified classification/segmentationAGPL-3.0
YOLOv92024PGI (Programmable Gradient Information), GELANAGPL-3.0
YOLOv102024NMS-free head, end-to-end trainingAGPL-3.0
YOLOv112025C3k2 block, SPPF plus C2PSA, fewer parametersAGPL-3.0
YOLOv122026Attention-centric architecture, FlashAttention-basedAGPL-3.0

YOLO's appeal is summarized in a single block:

from ultralytics import YOLO

# 1) Load - eight tasks share one API
model = YOLO('yolo11n.pt')         # nano
# model = YOLO('yolo11n-seg.pt')   # segmentation
# model = YOLO('yolo11n-pose.pt')  # pose
# model = YOLO('yolo11n-obb.pt')   # oriented bounding box
# model = YOLO('yolo11n-cls.pt')   # classification

# 2) Inference
results = model('input.jpg')
for r in results:
    print(r.boxes.xyxy)    # coordinates
    print(r.boxes.conf)    # confidence
    print(r.boxes.cls)     # class

# 3) Training
model.train(data='coco.yaml', epochs=100, imgsz=640)

# 4) Export - ONNX, TensorRT, CoreML, TFLite all in one line
model.export(format='onnx')
model.export(format='engine')      # TensorRT
model.export(format='coreml')

Warning about AGPL. Using YOLO models in a SaaS or web service triggers source disclosure obligations. Commercial use requires the Ultralytics Enterprise license. Some companies use RT-DETR, DAMO-YOLO, or D-FINE under Apache-2.0 to avoid this.


Chapter 6 - MMDetection / MMCV / OpenMMLab - The Widest Catalog

OpenMMLab, run by the Shanghai AI Lab, owns the widest model catalog in vision. It has more than 10 sub-projects including MMDetection (detection), MMSegmentation (segmentation), MMPose (pose), MMTracking (tracking), MMDetection3D (3D detection), and MMYOLO (unified YOLO).

Two distinguishing traits:

First, all models share one config system. Swapping a Mask R-CNN backbone to ConvNeXt, FPN to BiFPN, or the head to DETR is a few yaml lines.

Second, benchmark reproducibility is strong. Reaching paper results within plus or minus 0.3 mAP is normal. It is the framework closest to the academic standard.

from mmdet.apis import init_detector, inference_detector

config = 'configs/yolox/yolox_s_8xb8-300e_coco.py'
checkpoint = 'yolox_s.pth'

model = init_detector(config, checkpoint, device='cuda:0')
result = inference_detector(model, 'demo.jpg')
# result.pred_instances.bboxes, scores, labels

The downside of MMDetection is the steep learning curve. You have to understand the config system, the Registry pattern, and the Hook system before plugging in your own model. Better suited for "going deep" than "starting fast".


Chapter 7 - Roboflow Universe Plus Supervision - From Annotation to Training

Roboflow sits in the GitHub spot for vision data. As of 2026 Roboflow Universe hosts over 300,000 public datasets and over 50,000 pretrained models.

Two key tools:

Roboflow Annotate - Web-based annotation. Boxes, polygons, keypoints, and OBB are all supported. The auto-label feature calls SAM and Grounding DINO to draft proposals.

Supervision - The vision utility kit Roboflow open-sourced. Visualization, filtering, metrics, and trackers all live in one package.

import supervision as sv
from ultralytics import YOLO

model = YOLO('yolo11n.pt')
results = model('input.jpg')[0]

# Convert to Roboflow Supervision's Detection object
detections = sv.Detections.from_ultralytics(results)

# Visualize - box plus label in one line
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator()

annotated = box_annotator.annotate(scene=image, detections=detections)
annotated = label_annotator.annotate(scene=annotated, detections=detections)

# Tracking - ByteTrack in one line
tracker = sv.ByteTrack()
detections = tracker.update_with_detections(detections)

The strength of Supervision is that it separates model inference from visualization and metrics. YOLO, Detectron, and MMDetection outputs all unify under the same Detections object.


Chapter 8 - HuggingFace Transformers Vision - From ViT to DETR

HuggingFace Transformers is not just for NLP. As of 2026 more than 200 vision models are registered.

Representative catalog:

  • ViT (Vision Transformer) - the classification standard
  • DETR / Deformable DETR / DINO - transformer-based detection
  • Mask2Former / OneFormer - unified segmentation
  • OWL-ViT / OWLv2 - open-vocabulary detection
  • CLIP / SigLIP / SigLIP 2 - image-text embeddings
  • DINOv2 / DINOv3 - self-supervised backbones
  • SAM / SAM 2 - segmentation
  • Depth Anything v2 / v3 - depth estimation
from transformers import pipeline

# Classification - one line
classifier = pipeline('image-classification', model='google/vit-base-patch16-224')
print(classifier('input.jpg'))

# Detection
detector = pipeline('object-detection', model='facebook/detr-resnet-50')

# Open-vocabulary detection
detector = pipeline('zero-shot-object-detection', model='google/owlv2-base-patch16-ensemble')
print(detector('input.jpg', candidate_labels=['cat', 'dog', 'person']))

# Segmentation
segmenter = pipeline('image-segmentation', model='facebook/mask2former-swin-large-coco-panoptic')

HuggingFace's appeal is one-line inference via pipeline(). Swapping models is just swapping a string. Training is better done elsewhere though - HF Trainer feels awkward for vision compared to PyTorch Lightning or MMDetection.


Chapter 9 - Segment Anything 2 (SAM 2) - The New Standard for Video Masks

Meta's SAM (Segment Anything Model) first appeared in April 2023. SAM 2 released in July 2024 added memory attention for video, not just images. Catch a mask in one frame, and the rest are tracked automatically.

As of 2026, the SAM family:

ModelReleaseParametersNotes
SAM202391M-636MImage segmentation
SAM 2202439M-224MVideo plus image
SAM 2.1late 2024sameBetter on small objects and occlusion
SAMURAI2024sameKalman-filter-based tracking boost
FastSAM202368MYOLOv8-seg backbone, 50x faster
MobileSAM20239.8MLightweight for mobile
EfficientSAM202326MKD-compressed

SAM 2 usage:

from sam2.build_sam import build_sam2_video_predictor

predictor = build_sam2_video_predictor(
    'configs/sam2_hiera_l.yaml',
    'checkpoints/sam2_hiera_large.pt'
)

# Init video
state = predictor.init_state(video_path='video.mp4')

# One click in first frame - mask auto-tracked
predictor.add_new_points(
    inference_state=state,
    frame_idx=0,
    obj_id=1,
    points=[[210, 350]],
    labels=[1]
)

# Track masks across all frames
for frame_idx, obj_ids, masks in predictor.propagate_in_video(state):
    # masks shape: (num_objects, H, W)
    pass

SAM 2's value is that once you teach it, it follows forever. Labeling cost no longer scales with video length. CVAT, Label Studio, and Roboflow all adopted SAM 2 integration as a built-in feature.


Chapter 10 - Grounding DINO 1.5 / 1.6 - Drawing Boxes from Text

IDEA Research's Grounding DINO is the model that made "open-vocabulary detection" the standard. The 1.0 came in 2023, 1.5 Pro/Edge in 2024, and 1.6 in late 2024.

Traditional YOLO and Detectron only detect classes (80 or 1203) seen during training. Grounding DINO is different - it draws boxes from natural language prompts like "red car", "door handle", or "person holding an umbrella".

from groundingdino.util.inference import load_model, load_image, predict

model = load_model('GroundingDINO_SwinT_OGC.cfg.py', 'groundingdino_swint_ogc.pth')
image_source, image = load_image('input.jpg')

# Natural language prompt - separate noun phrases with periods
TEXT_PROMPT = 'red car. person holding umbrella. door handle.'
BOX_THRESHOLD = 0.35
TEXT_THRESHOLD = 0.25

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_THRESHOLD,
    text_threshold=TEXT_THRESHOLD,
)

Grounding SAM - the pipeline of Grounding DINO catching boxes plus SAM (or SAM 2) generating masks inside those boxes. The de facto starting point for labeling in 2026.

# 1) Grounding DINO for boxes
boxes, _, _ = predict(model_gdino, image, 'cat. dog.', 0.35, 0.25)

# 2) SAM for masks
sam_predictor.set_image(image_source)
masks, _, _ = sam_predictor.predict(box=boxes, multimask_output=False)

These two lines are the entirety of "object segmentation without a dataset". Drawing boxes on 10,000 images by hand collapses into a 30-minute script.


Chapter 11 - Florence-2 and YOLO-World - Other Open-Vocabulary Contenders

There are two more strong open-vocabulary players besides Grounding DINO.

Florence-2 (Microsoft, 2024) - handles classification, captioning, detection, segmentation, and OCR with one model. Very compact at 0.23B and 0.77B parameters, but the quality is high. Instead of natural language prompts it uses task tokens like <OD> (detection), <DENSE_REGION_CAPTION>, and <REFERRING_EXPRESSION_SEGMENTATION>.

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained('microsoft/Florence-2-large', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('microsoft/Florence-2-large', trust_remote_code=True)

prompt = '<OD>'  # Object Detection task token
inputs = processor(text=prompt, images=image, return_tensors='pt')
generated_ids = model.generate(**inputs, max_new_tokens=1024, num_beams=3)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed = processor.post_process_generation(generated_text, task='<OD>', image_size=(W, H))

YOLO-World (Tencent, 2024) - YOLO speed with open-vocabulary on top. 10-20x faster than Grounding DINO, detects from natural language prompts without training.

from ultralytics import YOLOWorld

model = YOLOWorld('yolov8x-worldv2.pt')
model.set_classes(['red car', 'door handle', 'person holding umbrella'])
results = model.predict('input.jpg')

When to pick what. Quality first - Grounding DINO 1.6, speed first - YOLO-World, multiple tasks in one model - Florence-2. These three form the 2026 open-vocabulary triangle.


Chapter 12 - VLM (Vision Language Model) - "Ask the Image"

After GPT-4V appeared in 2024, VLMs became a new layer in computer vision. As of May 2026 the major VLMs are:

Closed source (API)

  • GPT-4o / GPT-4.5-vision (OpenAI)
  • Claude 3.5 Sonnet / Claude 4 Opus (Anthropic)
  • Gemini 2.0 Flash / Gemini 2.0 Pro (Google)

Open source

  • Qwen2-VL / Qwen2.5-VL (Alibaba) - 2B/7B/72B
  • InternVL 2.5 (OpenGVLab) - 1B/2B/4B/8B/26B/40B/76B
  • Llava-OneVision (Bytedance/UW) - 0.5B/7B/72B
  • CogVLM2 (Zhipu) - 19B
  • Phi-3.5-vision (Microsoft) - 4.2B
  • Pixtral 12B (Mistral) - 12B

VLM usage splits into two patterns.

(a) Vision QA - "How many people are in this photo?"

from anthropic import Anthropic
import base64

client = Anthropic()
with open('input.jpg', 'rb') as f:
    img_b64 = base64.standard_b64encode(f.read()).decode('utf-8')

message = client.messages.create(
    model='claude-3-5-sonnet-20241022',
    max_tokens=1024,
    messages=[{
        'role': 'user',
        'content': [
            {'type': 'image', 'source': {'type': 'base64', 'media_type': 'image/jpeg', 'data': img_b64}},
            {'type': 'text', 'text': 'How many people are in this image?'}
        ]
    }]
)

(b) Structured Output - "Extract items and prices from this receipt as JSON"

# Forcing a JSON schema ends OCR plus parsing in one call
schema = {
    'type': 'object',
    'properties': {
        'items': {'type': 'array', 'items': {
            'type': 'object',
            'properties': {
                'name': {'type': 'string'},
                'price': {'type': 'number'},
                'quantity': {'type': 'integer'}
            }
        }},
        'total': {'type': 'number'}
    }
}

VLM limits are also clear. (1) Weak coordinates - "third from the top-left" is fine, but exact pixel coordinates are inaccurate. (2) Expensive and slow - 10,000 images on GPT-4o costs 10-50 dollars, while YOLO is free and finishes in a minute. (3) Not deterministic - the same question can produce different answers.

Hence the pattern: ask "what to look for" with a VLM, find "where it is" with a traditional model.


Chapter 13 - 3D Vision - DUSt3R, MASt3R, VGGT

The biggest change in 3D vision in 2024-2025 was that 3D reconstruction from just two photos became possible.

DUSt3R (Naver Labs Europe, 2024) - takes two images and directly regresses pixel-wise 3D pointmaps. Works without knowing camera intrinsics. Compressed the complex SfM and MVS pipelines into one model.

MASt3R (Naver Labs, 2024) - DUSt3R plus matching. Outputs pixel correspondences between two images. Directly usable for SLAM and localization.

VGGT (Meta, 2025) - Visual Geometry Grounded Transformer. Takes multiple images at once and estimates camera poses, depth maps, and pointmaps simultaneously. Overcomes the pairwise limit of DUSt3R.

Spann3R (2024) - sequential 3D reconstruction via memory tokens. Closer to video SLAM results.

from dust3r.inference import inference
from dust3r.model import AsymmetricCroCo3DStereo

model = AsymmetricCroCo3DStereo.from_pretrained('naver/DUSt3R_ViTLarge_BaseDecoder_512_dpt')

images = load_images(['img1.jpg', 'img2.jpg'], size=512)
pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)
output = inference(pairs, model, device='cuda', batch_size=1)
# extract pointmap, confidence, depth from output

This is a field where Korean contributions are large. The Grenoble team at Naver Labs Europe made DUSt3R, MASt3R, and CroCo.


Chapter 14 - Depth Anything v2/v3, Marigold, DepthPro

Monocular depth estimation exploded in 2024-2025.

Depth Anything v2 (HKU, 2024) - a strong depth model trained on 62 million unlabeled images. Four sizes: Small (24M), Base (97M), Large (335M), Giant (1.3B). Depth Anything v3 (2025) strengthened video consistency and metric depth.

Marigold (ETH Zürich, 2024) - Stable Diffusion fine-tuned for depth. Diffusion-based means good detail, but slow.

DepthPro (Apple, 2024) - estimates metric depth from one image in 0.3 seconds. The basis for iPhone depth without LiDAR.

from transformers import pipeline

pipe = pipeline(task='depth-estimation', model='depth-anything/Depth-Anything-V2-Large-hf')
depth = pipe('input.jpg')['depth']  # PIL Image

When to pick what. Relative depth is enough - Depth Anything, metric depth needed - DepthPro, detail first - Marigold.


Chapter 15 - Pose Estimation - MMPose, OpenPose, AlphaPose, DWPose

Pose estimation finds keypoints. The 2026 standard catalog:

ToolKeypointsNotes
MediaPipe Pose33Mobile real-time
OpenPose25 (BODY_25)Multi-person, older standard
AlphaPose17 (COCO)Top-down accuracy
MMPose17-133Widest model catalog
DWPose133 (full body, face, hands)The standard for ControlNet pose conditioning
RTMPose17Mobile real-time, under MMPose

DWPose became the de facto pose standard in 2024-2026. The reason is simple - ControlNet, AnimateDiff, and Stable Video Diffusion all accept DWPose keypoints as conditions. Pose conditioning in generative AI is mostly DWPose.

from mmpose.apis import inference_topdown, init_model

config = 'configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb320-270e_cocktail14-384x288.py'
checkpoint = 'rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288.pth'

model = init_model(config, checkpoint, device='cuda')
results = inference_topdown(model, 'input.jpg', bboxes=person_boxes)
# results[i].pred_instances.keypoints, keypoint_scores

Chapter 16 - Object Tracking - ByteTrack, BoT-SORT, OC-SORT, DEVA

Tracking assigns the same ID to the same object across video frames. The 2026 standard narrows to four:

TrackerInputNotes
ByteTrack (2022)Box plus confidenceTwo-stage matching for low-confidence boxes. Most widely used
BoT-SORT (2022)Box plus ReID embeddingCamera motion compensation
OC-SORT (2023)BoxObservation-centric, robust to occlusion
DEVA (2023)Box plus maskPairs with SAM, video segmentation tracking

ByteTrack is integrated into Roboflow Supervision, so it is a one-liner:

import supervision as sv

tracker = sv.ByteTrack(track_thresh=0.5, track_buffer=30)

for frame, detections in stream():
    detections = tracker.update_with_detections(detections)
    # detections.tracker_id holds IDs

Remember also that SAM 2 itself acts as a tracker. One click in one frame, and the mask propagates across the whole video. Tracking and segmentation are now merged into one model.


Chapter 17 - Diffusion-Based Vision - ControlNet, IP-Adapter

Generative vision is no longer "another field". ControlNet and IP-Adapter take detection or segmentation outputs as input conditions and generate new images.

  • ControlNet - takes Canny edge, depth, pose (DWPose), segmentation, normal as conditions
  • IP-Adapter - takes an image itself as style or content condition
  • T2I-Adapter - a lighter alternative to ControlNet
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
import torch

controlnet = ControlNetModel.from_pretrained('thibaud/controlnet-openpose-sdxl-1.0', torch_dtype=torch.float16)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    controlnet=controlnet,
    torch_dtype=torch.float16
).to('cuda')

pose_image = compute_dwpose(input_image)  # DWPose keypoint image
image = pipe('a person dancing in the rain', image=pose_image).images[0]

This snippet is the standard "vision recognition -> vision generation" pipeline. Detection and generation now live in the same toolchain.


Chapter 18 - Embedding Models - CLIP, SigLIP, DINOv2, DINOv3

Models that turn images into vectors are the most-called computer vision models in 2026. Image search, dedup, clustering, and zero-shot classification all run on embeddings.

ModelTraining signalDimStrength
CLIP (2021)Image-text pairs512/768Aligned with text
OpenCLIPSame, larger dataSameStronger baseline
SigLIP (2023)Sigmoid lossSameMore efficient than CLIP
SigLIP 2 (2024)Multi-taskSameStrong on OCR, documents
DINOv2 (2023)Self-supervised (SSL)768/1024/1536Text-independent, strong features
DINOv3 (2025)Self-supervised, largerSameSuccessor to DINOv2

When to pick which embedding:

  • Text-to-image search -> CLIP or SigLIP 2
  • Image-to-image search -> DINOv3
  • Downstream classification head training -> DINOv3
  • OCR or document work -> SigLIP 2
from transformers import AutoModel, AutoProcessor
import torch

model = AutoModel.from_pretrained('facebook/dinov3-vitl-pretrain-lvd1689m')
processor = AutoProcessor.from_pretrained('facebook/dinov3-vitl-pretrain-lvd1689m')

inputs = processor(images=image, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)

embedding = outputs.last_hidden_state.mean(dim=1)  # (1, 1024)

Chapter 19 - Annotation Tools - CVAT, Label Studio, Roboflow, VIA

Labeling tools in 2026 share one common trait - AI-assist is the default. SAM, Grounding DINO, and YOLO are called directly inside the tool.

ToolLicenseStrength
CVATMIT, open sourceWidest format support, strong on video
Label StudioApache 2.0Unified NLP/audio/image with ML backends
Roboflow AnnotateCommercial SaaSSAM and Grounding DINO integration, collaboration
VIABSD, open sourceLightweight single HTML file

Team-scale work - Roboflow or CVAT, model integration important - Label Studio, start within 5 minutes - VIA.


Chapter 20 - Inference Runtimes - ONNX Runtime, TensorRT, OpenVINO

Running trained models fast is what an inference runtime does. The 2026 options:

RuntimeTargetStrength
ONNX RuntimeCross-platformCPU/GPU/NPU, most standard
TensorRTNVIDIA GPUTop speed, INT8 and FP8 quantization
OpenVINOIntel CPU/iGPU/NPUBest on x86 PC
CoreMLApple SiliconiOS/macOS, leverages ANE
TFLiteMobile (Android)XNNPACK, Hexagon
NCNNMobile/embeddedTencent, ARM-optimized
MNNMobile/embeddedAlibaba, strong OpenCL

YOLO can export all of them in one line:

from ultralytics import YOLO
model = YOLO('yolo11n.pt')

model.export(format='onnx', dynamic=True, simplify=True)
model.export(format='engine', half=True)        # TensorRT FP16
model.export(format='openvino', int8=True)      # OpenVINO INT8
model.export(format='coreml', nms=True)         # CoreML
model.export(format='tflite', int8=True)        # TFLite INT8
model.export(format='ncnn')                     # NCNN

Selection rules. NVIDIA GPU server -> TensorRT. Intel PC -> OpenVINO. iOS/macOS -> CoreML. Android -> TFLite. Not sure -> ONNX Runtime.


Chapter 21 - Mobile Vision - ML Kit, Vision Framework, MNN

When you ship vision in a mobile app, native frameworks are the safest bet.

Google ML Kit (Android/iOS) - faces, barcodes, text, landmarks, translation, pose. On-device or server options.

Apple Vision Framework (iOS/macOS) - over 100 vision requests like VNDetectFaceRectangles and VNDetectHumanBodyPoseRequest. VNGenerateForegroundInstanceMaskRequest added in 2024 is mobile SAM.

MNN (Alibaba) - the de facto standard in the Chinese mobile ecosystem. Embedded in Alibaba, Pinduoduo, and ByteDance apps.

// Swift / Apple Vision
import Vision

let request = VNDetectHumanBodyPoseRequest { request, error in
    guard let obs = request.results as? [VNHumanBodyPoseObservation] else { return }
    for observation in obs {
        let points = try? observation.recognizedPoints(.all)
        // points holds keypoints
    }
}
let handler = VNImageRequestHandler(cgImage: cgImage)
try? handler.perform([request])

Rule. Standard ML features (faces, text, barcodes) - native, fast and free. Custom models - CoreML (iOS) and TFLite (Android), ship your own model.


Chapter 22 - The Korean Vision Ecosystem

Korea's computer vision contributions are heavy on both the academic and industrial sides.

Naver Labs Europe - the Grenoble team behind DUSt3R, MASt3R, and CroCo. Frontier in 3D vision worldwide.

KAIST CVLab - groups of Professors In So Kweon, Yong Man Ro, and Eunbyung Park. Regular publications at CVPR and ICCV.

Lunit - medical imaging AI. Holds many FDA and CE clearances for chest X-ray, mammography, and digital pathology. INSIGHT CXR, MMG, BCC are its flagship products.

Seerslab - AR/VR/vision. Naver Z subsidiary, runs face recognition and filter engines for ZEPETO and SNOW.

VUNO - medical imaging AI. DeepASR, DeepCT, DeepBrain.

MakinaRocks - industrial vision anomaly detection. Semiconductor and display inspection.

Riiid / Trinity / Innospace - education and defense.

Korean academia is strong in faces, OCR, autonomous driving, and medical imaging. Korea sits in the world's top 4-5 for CVPR publications.


Chapter 23 - The Japanese Vision Ecosystem

Preferred Networks (PFN) - Japan's largest AI company. PFN-Vision library, vision for chemistry, materials, and robotics.

ABEJA - Tokyo-based vision SaaS. Store analytics for retail and manufacturing.

Recruit Holdings - vision recruiting systems via subsidiaries like Indeed and JOBSRU.

ALBERT (now Accenture Japan) - industrial AI and vision consulting.

Nikon AI - medical imaging and industrial inspection. A camera company expanded into vision AI.

LeapMind - strong embedded vision via the Blueoil quantized inference engine.

SoftBank Robotics - vision systems on Pepper and NAO.

Fast Retailing (UNIQLO) - in-house vision library for store camera analytics.

Japanese academia is strong in OCR, document vision, and robotics vision. UTokyo IIS, Kyoto U, Nagoya U, and Tokyo Tech regularly publish at CVPR.


Chapter 24 - The Posture of a 2026 Computer Vision Engineer

Boiling down 24 chapters into a single page:

(1) Avoid training whenever possible. A combination of pretrained models, open-vocabulary models, and VLMs covers 80%. Refining a Grounding DINO prompt takes less time than collecting 10,000 images.

(2) Design pipelines, do not train models. The job in 2026 is "which model to call, in what order", not "which loss to train on".

(3) Divide labor between VLMs and traditional models. Ask the VLM "what to find", ask the traditional model "where it is". This is the standard shape of a 2026 vision system.

(4) Do not underestimate the inference runtime. A model that runs at 30 fps in PyTorch hits 200 fps in TensorRT. A 6x gap is a different product in user experience.

(5) Check the license. The YOLO family is AGPL. Dropping it straight into a company product triggers source disclosure. Know RT-DETR, D-FINE, and MMDetection as alternatives.

(6) Leave "pixel work" to OpenCV. Color spaces, resizing, decoding, encoding - OpenCV is still the fastest and most stable.

(7) Labeling tools come before datasets. Start with CVAT or Roboflow with SAM 2 integration. The era of drawing boxes by hand is over.

The future of a vision engineer is not "the person who makes the model" but "the person who composes the models". There are many tools. Stand in the right line.


References