필사 모드: Computer Vision Frameworks 2026 - OpenCV 4, MediaPipe, Detectron2, YOLO v11, MMDetection, SAM 2, Grounding DINO Deep Dive
EnglishPrologue - Computer Vision Became "Asking" Instead of "Seeing"
Computer vision in the 2010s had a clear shape. SIFT, HOG, and Haar extracted features, SVMs and random forests classified them, and OpenCV tied it all together. The early 2020s belonged to ResNet, EfficientNet, and Mask R-CNN - 90% of the job was collecting datasets, training models, and squeezing out mAP.
The landscape in 2026 looks different. A single sentence like "estimate the pose of the person in a red helmet in this photo" maps to a three-line pipeline: Grounding DINO catches the box, SAM 2 makes the mask, MMPose extracts the keypoints. We barely train anything. Instead, we design "which model to call, in what order".
This article walks through the 2026 computer vision stack in one breath. From the basics of OpenCV through SAM 2 and VLMs to DINOv3, DUSt3R, and mobile inference - the criteria for choosing the right tool, packed into a single page.
Chapter 1 - The 2026 CV Stack Map
Before diving into individual tools, let me draw the overall map. The 2026 computer vision world splits into five layers.
[Layer 5] Vision Language Model (VLM)
GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash
Qwen2-VL, InternVL 2.5, Pixtral 12B
|
[Layer 4] Open-Vocabulary Foundation
Grounding DINO 1.6, Florence-2, YOLO-World
SAM 2, DINOv3, CLIP, SigLIP
|
[Layer 3] Task-Specific Model
YOLO v11, Detectron3, MMDetection
MMPose, DWPose, ByteTrack, Depth Anything v3
|
[Layer 2] Inference Runtime
ONNX Runtime, TensorRT, OpenVINO
CoreML, TFLite, NCNN, MNN
|
[Layer 1] Image I/O and Primitives
OpenCV 4.10, Pillow-SIMD
FFmpeg, GStreamer
The higher the layer, the more "intelligent" the system gets, but latency rises with it. **VLMs run at one or two frames per second, YOLO v11 runs at over 100 fps.** The job of a 2026 vision engineer is composing these two ends.
One-line summary: **"Ask the question with a VLM, draw the answer with YOLO."**
Chapter 2 - OpenCV 4.10 / 5.x - Still the Starting Point for Everything
OpenCV did not die. It got even stronger in 2026. The reason is simple - reading images, cropping, converting color spaces, decoding video frames are all required by any deep learning pipeline.
As of May 2026 OpenCV 4.10 is the LTS, and 5.0 beta is in active development. Three key changes stand out.
First, **the DNN module became the default for ONNX inference.** You can call YOLO, ResNet, or ViT in one line via `cv2.dnn.readNetFromONNX()` without going through PyTorch or TensorFlow.
Second, **G-API (Graph API) is stable.** It expresses input-to-output as a graph and runs on OpenCL, CUDA, or Vulkan backends. Especially powerful on mobile and embedded.
Third, **CUDA and OpenCL acceleration are built in.** The `cv2.cuda` module runs Gaussian blur, optical flow, and image warping directly on the GPU.
1) Read image - color order is BGR (careful!)
img = cv2.imread('input.jpg')
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
2) Resize - INTER_AREA is best for shrinking
small = cv2.resize(img_rgb, (640, 640), interpolation=cv2.INTER_AREA)
3) DNN inference - load ONNX model directly
net = cv2.dnn.readNetFromONNX('yolov11n.onnx')
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
blob = cv2.dnn.blobFromImage(small, 1/255.0, (640, 640), swapRB=True)
net.setInput(blob)
outputs = net.forward()
Two things to remember: **OpenCV uses the BGR color space** (different from PIL/PyTorch), and **`imread` returns None on failure** (it does not raise). These two facts cost someone an hour of debugging every week in 2026.
Chapter 3 - MediaPipe 0.10 / MediaPipe Studio - The New Standard for Mobile Real-Time
Google's MediaPipe went through a major shift in late 2024. The older "MediaPipe Solutions API" merged into the **MediaPipe Tasks API**, and the no-code train/deploy tool **MediaPipe Studio** appeared.
As of 2026, MediaPipe offers the following solutions through one-line APIs:
- **Hand Landmarker** - 21 hand keypoints
- **Pose Landmarker** - 33 body keypoints plus segmentation mask
- **Face Landmarker** - 478 facial mesh points plus blendshapes
- **Image Embedder** - MobileNet-V3 embeddings
- **Object Detector** - EfficientDet-Lite
- **Image Segmenter** - Selfie segmentation, hair segmentation
- **Gesture Recognizer** - 7 pre-trained gestures
- **Image Classifier** - EfficientNet-Lite
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
Pose Landmarker - one-line instance creation
options = vision.PoseLandmarkerOptions(
base_options=python.BaseOptions(model_asset_path='pose_landmarker.task'),
running_mode=vision.RunningMode.VIDEO,
num_poses=2,
min_pose_detection_confidence=0.5,
)
landmarker = vision.PoseLandmarker.create_from_options(options)
Inference per frame
result = landmarker.detect_for_video(mp_image, timestamp_ms)
for pose in result.pose_landmarks:
for lm in pose:
print(lm.x, lm.y, lm.z, lm.visibility)
MediaPipe's real value is the **guaranteed 30 to 60 FPS on mobile**. The same task in PyTorch barely manages 5 FPS on a phone. CPU/GPU/NPU auto-dispatch, TFLite optimization, and the XNNPACK backend are bundled together.
The limit is also clear - if the task is not predefined, you cannot use it, and training your own model requires the MediaPipe Model Maker detour. MediaPipe owns the "do a fixed job, fast" niche.
Chapter 4 - Detectron2 / Detectron3 - Meta's Orthodox Detection Toolkit
Meta AI Research's Detectron2 has been the de facto academic standard since its 2019 release. **Detectron3** entered beta in late 2025, and as of 2026 the two coexist.
The differences:
| Item | Detectron2 | Detectron3 |
| --- | --- | --- |
| Default backbones | ResNet, ViT | ConvNeXt v2, DINOv3, SAM2 encoder |
| Detection heads | Mask R-CNN, Cascade R-CNN | Mask R-CNN, Mask2Former, ViTDet |
| Training framework | PyTorch 1.x/2.x | PyTorch 2.5+, torch.compile by default |
| Config system | YACS (yaml) | LazyConfig (pythonic) |
| Distributed training | DDP | FSDP plus activation checkpoint |
A Detectron2 code snippet:
from detectron2 import model_zoo
from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
"COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
))
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
"COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
)
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
predictor = DefaultPredictor(cfg)
outputs = predictor(image)
outputs["instances"].pred_boxes, pred_masks, pred_classes
Detectron3's LazyConfig replaces yaml with Python objects. IDE autocompletion and type checking work, and conditional logic stays clean.
**When to use Detectron?** When reproducing papers, comparing COCO/LVIS benchmarks, or when you need a "standard" Mask R-CNN baseline. In production many teams migrate to YOLO or MMDetection.
Chapter 5 - The YOLO Family - From v8 to v12
Ultralytics-managed YOLO had v8 in 2024, v9 in late 2024, v10 in 2025, v11 in late 2025, and v12 in early 2026. No other vision framework matches this pace of major releases.
Summary:
| Version | Release | Key change | License |
| --- | --- | --- | --- |
| YOLOv8 | 2023 | Anchor-free, unified classification/segmentation | AGPL-3.0 |
| YOLOv9 | 2024 | PGI (Programmable Gradient Information), GELAN | AGPL-3.0 |
| YOLOv10 | 2024 | NMS-free head, end-to-end training | AGPL-3.0 |
| YOLOv11 | 2025 | C3k2 block, SPPF plus C2PSA, fewer parameters | AGPL-3.0 |
| YOLOv12 | 2026 | Attention-centric architecture, FlashAttention-based | AGPL-3.0 |
YOLO's appeal is summarized in a single block:
from ultralytics import YOLO
1) Load - eight tasks share one API
model = YOLO('yolo11n.pt') # nano
model = YOLO('yolo11n-seg.pt') # segmentation
model = YOLO('yolo11n-pose.pt') # pose
model = YOLO('yolo11n-obb.pt') # oriented bounding box
model = YOLO('yolo11n-cls.pt') # classification
2) Inference
results = model('input.jpg')
for r in results:
print(r.boxes.xyxy) # coordinates
print(r.boxes.conf) # confidence
print(r.boxes.cls) # class
3) Training
model.train(data='coco.yaml', epochs=100, imgsz=640)
4) Export - ONNX, TensorRT, CoreML, TFLite all in one line
model.export(format='onnx')
model.export(format='engine') # TensorRT
model.export(format='coreml')
**Warning about AGPL.** Using YOLO models in a SaaS or web service triggers source disclosure obligations. Commercial use requires the Ultralytics Enterprise license. Some companies use RT-DETR, DAMO-YOLO, or D-FINE under Apache-2.0 to avoid this.
Chapter 6 - MMDetection / MMCV / OpenMMLab - The Widest Catalog
OpenMMLab, run by the Shanghai AI Lab, owns the widest model catalog in vision. It has more than 10 sub-projects including **MMDetection** (detection), **MMSegmentation** (segmentation), **MMPose** (pose), **MMTracking** (tracking), **MMDetection3D** (3D detection), and **MMYOLO** (unified YOLO).
Two distinguishing traits:
First, **all models share one config system.** Swapping a Mask R-CNN backbone to ConvNeXt, FPN to BiFPN, or the head to DETR is a few yaml lines.
Second, **benchmark reproducibility is strong.** Reaching paper results within plus or minus 0.3 mAP is normal. It is the framework closest to the academic standard.
from mmdet.apis import init_detector, inference_detector
config = 'configs/yolox/yolox_s_8xb8-300e_coco.py'
checkpoint = 'yolox_s.pth'
model = init_detector(config, checkpoint, device='cuda:0')
result = inference_detector(model, 'demo.jpg')
result.pred_instances.bboxes, scores, labels
The downside of MMDetection is **the steep learning curve.** You have to understand the config system, the Registry pattern, and the Hook system before plugging in your own model. Better suited for "going deep" than "starting fast".
Chapter 7 - Roboflow Universe Plus Supervision - From Annotation to Training
Roboflow sits in the GitHub spot for vision data. As of 2026 **Roboflow Universe** hosts over 300,000 public datasets and over 50,000 pretrained models.
Two key tools:
**Roboflow Annotate** - Web-based annotation. Boxes, polygons, keypoints, and OBB are all supported. The auto-label feature calls SAM and Grounding DINO to draft proposals.
**Supervision** - The vision utility kit Roboflow open-sourced. Visualization, filtering, metrics, and trackers all live in one package.
from ultralytics import YOLO
model = YOLO('yolo11n.pt')
results = model('input.jpg')[0]
Convert to Roboflow Supervision's Detection object
detections = sv.Detections.from_ultralytics(results)
Visualize - box plus label in one line
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator()
annotated = box_annotator.annotate(scene=image, detections=detections)
annotated = label_annotator.annotate(scene=annotated, detections=detections)
Tracking - ByteTrack in one line
tracker = sv.ByteTrack()
detections = tracker.update_with_detections(detections)
The strength of Supervision is that it **separates model inference from visualization and metrics**. YOLO, Detectron, and MMDetection outputs all unify under the same `Detections` object.
Chapter 8 - HuggingFace Transformers Vision - From ViT to DETR
HuggingFace Transformers is not just for NLP. As of 2026 more than 200 vision models are registered.
Representative catalog:
- **ViT (Vision Transformer)** - the classification standard
- **DETR / Deformable DETR / DINO** - transformer-based detection
- **Mask2Former / OneFormer** - unified segmentation
- **OWL-ViT / OWLv2** - open-vocabulary detection
- **CLIP / SigLIP / SigLIP 2** - image-text embeddings
- **DINOv2 / DINOv3** - self-supervised backbones
- **SAM / SAM 2** - segmentation
- **Depth Anything v2 / v3** - depth estimation
from transformers import pipeline
Classification - one line
classifier = pipeline('image-classification', model='google/vit-base-patch16-224')
print(classifier('input.jpg'))
Detection
detector = pipeline('object-detection', model='facebook/detr-resnet-50')
Open-vocabulary detection
detector = pipeline('zero-shot-object-detection', model='google/owlv2-base-patch16-ensemble')
print(detector('input.jpg', candidate_labels=['cat', 'dog', 'person']))
Segmentation
segmenter = pipeline('image-segmentation', model='facebook/mask2former-swin-large-coco-panoptic')
HuggingFace's appeal is one-line inference via `pipeline()`. Swapping models is just swapping a string. Training is better done elsewhere though - HF Trainer feels awkward for vision compared to PyTorch Lightning or MMDetection.
Chapter 9 - Segment Anything 2 (SAM 2) - The New Standard for Video Masks
Meta's SAM (Segment Anything Model) first appeared in April 2023. **SAM 2** released in July 2024 added **memory attention for video**, not just images. Catch a mask in one frame, and the rest are tracked automatically.
As of 2026, the SAM family:
| Model | Release | Parameters | Notes |
| --- | --- | --- | --- |
| SAM | 2023 | 91M-636M | Image segmentation |
| SAM 2 | 2024 | 39M-224M | Video plus image |
| SAM 2.1 | late 2024 | same | Better on small objects and occlusion |
| SAMURAI | 2024 | same | Kalman-filter-based tracking boost |
| FastSAM | 2023 | 68M | YOLOv8-seg backbone, 50x faster |
| MobileSAM | 2023 | 9.8M | Lightweight for mobile |
| EfficientSAM | 2023 | 26M | KD-compressed |
SAM 2 usage:
from sam2.build_sam import build_sam2_video_predictor
predictor = build_sam2_video_predictor(
'configs/sam2_hiera_l.yaml',
'checkpoints/sam2_hiera_large.pt'
)
Init video
state = predictor.init_state(video_path='video.mp4')
One click in first frame - mask auto-tracked
predictor.add_new_points(
inference_state=state,
frame_idx=0,
obj_id=1,
points=[[210, 350]],
labels=[1]
)
Track masks across all frames
for frame_idx, obj_ids, masks in predictor.propagate_in_video(state):
masks shape: (num_objects, H, W)
pass
SAM 2's value is that **once you teach it, it follows forever**. Labeling cost no longer scales with video length. CVAT, Label Studio, and Roboflow all adopted SAM 2 integration as a built-in feature.
Chapter 10 - Grounding DINO 1.5 / 1.6 - Drawing Boxes from Text
IDEA Research's **Grounding DINO** is the model that made "open-vocabulary detection" the standard. The 1.0 came in 2023, 1.5 Pro/Edge in 2024, and 1.6 in late 2024.
Traditional YOLO and Detectron only detect classes (80 or 1203) seen during training. Grounding DINO is different - it draws boxes from natural language prompts like **"red car"**, **"door handle"**, or **"person holding an umbrella"**.
from groundingdino.util.inference import load_model, load_image, predict
model = load_model('GroundingDINO_SwinT_OGC.cfg.py', 'groundingdino_swint_ogc.pth')
image_source, image = load_image('input.jpg')
Natural language prompt - separate noun phrases with periods
TEXT_PROMPT = 'red car. person holding umbrella. door handle.'
BOX_THRESHOLD = 0.35
TEXT_THRESHOLD = 0.25
boxes, logits, phrases = predict(
model=model,
image=image,
caption=TEXT_PROMPT,
box_threshold=BOX_THRESHOLD,
text_threshold=TEXT_THRESHOLD,
)
**Grounding SAM** - the pipeline of Grounding DINO catching boxes plus SAM (or SAM 2) generating masks inside those boxes. The de facto starting point for labeling in 2026.
1) Grounding DINO for boxes
boxes, _, _ = predict(model_gdino, image, 'cat. dog.', 0.35, 0.25)
2) SAM for masks
sam_predictor.set_image(image_source)
masks, _, _ = sam_predictor.predict(box=boxes, multimask_output=False)
These two lines are the entirety of "object segmentation without a dataset". Drawing boxes on 10,000 images by hand collapses into a 30-minute script.
Chapter 11 - Florence-2 and YOLO-World - Other Open-Vocabulary Contenders
There are two more strong open-vocabulary players besides Grounding DINO.
**Florence-2** (Microsoft, 2024) - handles classification, captioning, detection, segmentation, and OCR with one model. Very compact at 0.23B and 0.77B parameters, but the quality is high. Instead of natural language prompts it uses **task tokens** like `<OD>` (detection), `<DENSE_REGION_CAPTION>`, and `<REFERRING_EXPRESSION_SEGMENTATION>`.
from transformers import AutoProcessor, AutoModelForCausalLM
processor = AutoProcessor.from_pretrained('microsoft/Florence-2-large', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('microsoft/Florence-2-large', trust_remote_code=True)
prompt = '<OD>' # Object Detection task token
inputs = processor(text=prompt, images=image, return_tensors='pt')
generated_ids = model.generate(**inputs, max_new_tokens=1024, num_beams=3)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed = processor.post_process_generation(generated_text, task='<OD>', image_size=(W, H))
**YOLO-World** (Tencent, 2024) - YOLO speed with open-vocabulary on top. 10-20x faster than Grounding DINO, detects from natural language prompts without training.
from ultralytics import YOLOWorld
model = YOLOWorld('yolov8x-worldv2.pt')
model.set_classes(['red car', 'door handle', 'person holding umbrella'])
results = model.predict('input.jpg')
When to pick what. **Quality first - Grounding DINO 1.6**, **speed first - YOLO-World**, **multiple tasks in one model - Florence-2**. These three form the 2026 open-vocabulary triangle.
Chapter 12 - VLM (Vision Language Model) - "Ask the Image"
After GPT-4V appeared in 2024, VLMs became a new layer in computer vision. As of May 2026 the major VLMs are:
**Closed source (API)**
- **GPT-4o** / **GPT-4.5-vision** (OpenAI)
- **Claude 3.5 Sonnet** / **Claude 4 Opus** (Anthropic)
- **Gemini 2.0 Flash** / **Gemini 2.0 Pro** (Google)
**Open source**
- **Qwen2-VL** / **Qwen2.5-VL** (Alibaba) - 2B/7B/72B
- **InternVL 2.5** (OpenGVLab) - 1B/2B/4B/8B/26B/40B/76B
- **Llava-OneVision** (Bytedance/UW) - 0.5B/7B/72B
- **CogVLM2** (Zhipu) - 19B
- **Phi-3.5-vision** (Microsoft) - 4.2B
- **Pixtral 12B** (Mistral) - 12B
VLM usage splits into two patterns.
**(a) Vision QA** - "How many people are in this photo?"
from anthropic import Anthropic
client = Anthropic()
with open('input.jpg', 'rb') as f:
img_b64 = base64.standard_b64encode(f.read()).decode('utf-8')
message = client.messages.create(
model='claude-3-5-sonnet-20241022',
max_tokens=1024,
messages=[{
'role': 'user',
'content': [
{'type': 'image', 'source': {'type': 'base64', 'media_type': 'image/jpeg', 'data': img_b64}},
{'type': 'text', 'text': 'How many people are in this image?'}
]
}]
)
**(b) Structured Output** - "Extract items and prices from this receipt as JSON"
Forcing a JSON schema ends OCR plus parsing in one call
schema = {
'type': 'object',
'properties': {
'items': {'type': 'array', 'items': {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'price': {'type': 'number'},
'quantity': {'type': 'integer'}
}
}},
'total': {'type': 'number'}
}
}
VLM limits are also clear. **(1) Weak coordinates** - "third from the top-left" is fine, but exact pixel coordinates are inaccurate. **(2) Expensive and slow** - 10,000 images on GPT-4o costs 10-50 dollars, while YOLO is free and finishes in a minute. **(3) Not deterministic** - the same question can produce different answers.
Hence the pattern: **ask "what to look for" with a VLM, find "where it is" with a traditional model.**
Chapter 13 - 3D Vision - DUSt3R, MASt3R, VGGT
The biggest change in 3D vision in 2024-2025 was that **3D reconstruction from just two photos** became possible.
**DUSt3R** (Naver Labs Europe, 2024) - takes two images and directly regresses pixel-wise 3D pointmaps. Works without knowing camera intrinsics. Compressed the complex SfM and MVS pipelines into one model.
**MASt3R** (Naver Labs, 2024) - DUSt3R plus matching. Outputs pixel correspondences between two images. Directly usable for SLAM and localization.
**VGGT** (Meta, 2025) - Visual Geometry Grounded Transformer. Takes multiple images at once and estimates camera poses, depth maps, and pointmaps simultaneously. Overcomes the pairwise limit of DUSt3R.
**Spann3R** (2024) - sequential 3D reconstruction via memory tokens. Closer to video SLAM results.
from dust3r.inference import inference
from dust3r.model import AsymmetricCroCo3DStereo
model = AsymmetricCroCo3DStereo.from_pretrained('naver/DUSt3R_ViTLarge_BaseDecoder_512_dpt')
images = load_images(['img1.jpg', 'img2.jpg'], size=512)
pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)
output = inference(pairs, model, device='cuda', batch_size=1)
extract pointmap, confidence, depth from output
This is a field where Korean contributions are large. The Grenoble team at Naver Labs Europe made DUSt3R, MASt3R, and CroCo.
Chapter 14 - Depth Anything v2/v3, Marigold, DepthPro
Monocular depth estimation exploded in 2024-2025.
**Depth Anything v2** (HKU, 2024) - a strong depth model trained on 62 million unlabeled images. Four sizes: Small (24M), Base (97M), Large (335M), Giant (1.3B). **Depth Anything v3** (2025) strengthened video consistency and metric depth.
**Marigold** (ETH Zürich, 2024) - Stable Diffusion fine-tuned for depth. Diffusion-based means good detail, but slow.
**DepthPro** (Apple, 2024) - estimates metric depth from one image in 0.3 seconds. The basis for iPhone depth without LiDAR.
from transformers import pipeline
pipe = pipeline(task='depth-estimation', model='depth-anything/Depth-Anything-V2-Large-hf')
depth = pipe('input.jpg')['depth'] # PIL Image
When to pick what. **Relative depth is enough** - Depth Anything, **metric depth needed** - DepthPro, **detail first** - Marigold.
Chapter 15 - Pose Estimation - MMPose, OpenPose, AlphaPose, DWPose
Pose estimation finds keypoints. The 2026 standard catalog:
| Tool | Keypoints | Notes |
| --- | --- | --- |
| MediaPipe Pose | 33 | Mobile real-time |
| OpenPose | 25 (BODY_25) | Multi-person, older standard |
| AlphaPose | 17 (COCO) | Top-down accuracy |
| MMPose | 17-133 | Widest model catalog |
| DWPose | 133 (full body, face, hands) | The standard for ControlNet pose conditioning |
| RTMPose | 17 | Mobile real-time, under MMPose |
**DWPose** became the de facto pose standard in 2024-2026. The reason is simple - **ControlNet, AnimateDiff, and Stable Video Diffusion all accept DWPose keypoints as conditions.** Pose conditioning in generative AI is mostly DWPose.
from mmpose.apis import inference_topdown, init_model
config = 'configs/wholebody_2d_keypoint/rtmpose/cocktail14/rtmw-x_8xb320-270e_cocktail14-384x288.py'
checkpoint = 'rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288.pth'
model = init_model(config, checkpoint, device='cuda')
results = inference_topdown(model, 'input.jpg', bboxes=person_boxes)
results[i].pred_instances.keypoints, keypoint_scores
Chapter 16 - Object Tracking - ByteTrack, BoT-SORT, OC-SORT, DEVA
Tracking assigns the same ID to the same object across video frames. The 2026 standard narrows to four:
| Tracker | Input | Notes |
| --- | --- | --- |
| ByteTrack (2022) | Box plus confidence | Two-stage matching for low-confidence boxes. Most widely used |
| BoT-SORT (2022) | Box plus ReID embedding | Camera motion compensation |
| OC-SORT (2023) | Box | Observation-centric, robust to occlusion |
| DEVA (2023) | Box plus mask | Pairs with SAM, video segmentation tracking |
ByteTrack is integrated into Roboflow Supervision, so it is a one-liner:
tracker = sv.ByteTrack(track_thresh=0.5, track_buffer=30)
for frame, detections in stream():
detections = tracker.update_with_detections(detections)
detections.tracker_id holds IDs
Remember also that **SAM 2 itself acts as a tracker**. One click in one frame, and the mask propagates across the whole video. Tracking and segmentation are now merged into one model.
Chapter 17 - Diffusion-Based Vision - ControlNet, IP-Adapter
Generative vision is no longer "another field". ControlNet and IP-Adapter take detection or segmentation outputs as input conditions and generate new images.
- **ControlNet** - takes Canny edge, depth, pose (DWPose), segmentation, normal as conditions
- **IP-Adapter** - takes an image itself as style or content condition
- **T2I-Adapter** - a lighter alternative to ControlNet
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
controlnet = ControlNetModel.from_pretrained('thibaud/controlnet-openpose-sdxl-1.0', torch_dtype=torch.float16)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
'stabilityai/stable-diffusion-xl-base-1.0',
controlnet=controlnet,
torch_dtype=torch.float16
).to('cuda')
pose_image = compute_dwpose(input_image) # DWPose keypoint image
image = pipe('a person dancing in the rain', image=pose_image).images[0]
This snippet is the standard **"vision recognition -> vision generation"** pipeline. Detection and generation now live in the same toolchain.
Chapter 18 - Embedding Models - CLIP, SigLIP, DINOv2, DINOv3
Models that turn images into vectors are the most-called computer vision models in 2026. Image search, dedup, clustering, and zero-shot classification all run on embeddings.
| Model | Training signal | Dim | Strength |
| --- | --- | --- | --- |
| CLIP (2021) | Image-text pairs | 512/768 | Aligned with text |
| OpenCLIP | Same, larger data | Same | Stronger baseline |
| SigLIP (2023) | Sigmoid loss | Same | More efficient than CLIP |
| SigLIP 2 (2024) | Multi-task | Same | Strong on OCR, documents |
| DINOv2 (2023) | Self-supervised (SSL) | 768/1024/1536 | Text-independent, strong features |
| DINOv3 (2025) | Self-supervised, larger | Same | Successor to DINOv2 |
When to pick which embedding:
- **Text-to-image search** -> CLIP or SigLIP 2
- **Image-to-image search** -> DINOv3
- **Downstream classification head training** -> DINOv3
- **OCR or document work** -> SigLIP 2
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained('facebook/dinov3-vitl-pretrain-lvd1689m')
processor = AutoProcessor.from_pretrained('facebook/dinov3-vitl-pretrain-lvd1689m')
inputs = processor(images=image, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1) # (1, 1024)
Chapter 19 - Annotation Tools - CVAT, Label Studio, Roboflow, VIA
Labeling tools in 2026 share one common trait - **AI-assist is the default**. SAM, Grounding DINO, and YOLO are called directly inside the tool.
| Tool | License | Strength |
| --- | --- | --- |
| CVAT | MIT, open source | Widest format support, strong on video |
| Label Studio | Apache 2.0 | Unified NLP/audio/image with ML backends |
| Roboflow Annotate | Commercial SaaS | SAM and Grounding DINO integration, collaboration |
| VIA | BSD, open source | Lightweight single HTML file |
**Team-scale work** - Roboflow or CVAT, **model integration** important - Label Studio, **start within 5 minutes** - VIA.
Chapter 20 - Inference Runtimes - ONNX Runtime, TensorRT, OpenVINO
Running trained models fast is what an inference runtime does. The 2026 options:
| Runtime | Target | Strength |
| --- | --- | --- |
| ONNX Runtime | Cross-platform | CPU/GPU/NPU, most standard |
| TensorRT | NVIDIA GPU | Top speed, INT8 and FP8 quantization |
| OpenVINO | Intel CPU/iGPU/NPU | Best on x86 PC |
| CoreML | Apple Silicon | iOS/macOS, leverages ANE |
| TFLite | Mobile (Android) | XNNPACK, Hexagon |
| NCNN | Mobile/embedded | Tencent, ARM-optimized |
| MNN | Mobile/embedded | Alibaba, strong OpenCL |
YOLO can export all of them in one line:
from ultralytics import YOLO
model = YOLO('yolo11n.pt')
model.export(format='onnx', dynamic=True, simplify=True)
model.export(format='engine', half=True) # TensorRT FP16
model.export(format='openvino', int8=True) # OpenVINO INT8
model.export(format='coreml', nms=True) # CoreML
model.export(format='tflite', int8=True) # TFLite INT8
model.export(format='ncnn') # NCNN
**Selection rules.** NVIDIA GPU server -> TensorRT. Intel PC -> OpenVINO. iOS/macOS -> CoreML. Android -> TFLite. Not sure -> ONNX Runtime.
Chapter 21 - Mobile Vision - ML Kit, Vision Framework, MNN
When you ship vision in a mobile app, native frameworks are the safest bet.
**Google ML Kit** (Android/iOS) - faces, barcodes, text, landmarks, translation, pose. On-device or server options.
**Apple Vision Framework** (iOS/macOS) - over 100 vision requests like VNDetectFaceRectangles and VNDetectHumanBodyPoseRequest. VNGenerateForegroundInstanceMaskRequest added in 2024 is mobile SAM.
**MNN** (Alibaba) - the de facto standard in the Chinese mobile ecosystem. Embedded in Alibaba, Pinduoduo, and ByteDance apps.
// Swift / Apple Vision
let request = VNDetectHumanBodyPoseRequest { request, error in
guard let obs = request.results as? [VNHumanBodyPoseObservation] else { return }
for observation in obs {
let points = try? observation.recognizedPoints(.all)
// points holds keypoints
}
}
let handler = VNImageRequestHandler(cgImage: cgImage)
try? handler.perform([request])
**Rule.** Standard ML features (faces, text, barcodes) - native, fast and free. Custom models - CoreML (iOS) and TFLite (Android), ship your own model.
Chapter 22 - The Korean Vision Ecosystem
Korea's computer vision contributions are heavy on both the academic and industrial sides.
**Naver Labs Europe** - the Grenoble team behind DUSt3R, MASt3R, and CroCo. Frontier in 3D vision worldwide.
**KAIST CVLab** - groups of Professors In So Kweon, Yong Man Ro, and Eunbyung Park. Regular publications at CVPR and ICCV.
**Lunit** - medical imaging AI. Holds many FDA and CE clearances for chest X-ray, mammography, and digital pathology. INSIGHT CXR, MMG, BCC are its flagship products.
**Seerslab** - AR/VR/vision. Naver Z subsidiary, runs face recognition and filter engines for ZEPETO and SNOW.
**VUNO** - medical imaging AI. DeepASR, DeepCT, DeepBrain.
**MakinaRocks** - industrial vision anomaly detection. Semiconductor and display inspection.
**Riiid / Trinity / Innospace** - education and defense.
Korean academia is strong in **faces, OCR, autonomous driving, and medical imaging**. Korea sits in the world's top 4-5 for CVPR publications.
Chapter 23 - The Japanese Vision Ecosystem
**Preferred Networks (PFN)** - Japan's largest AI company. PFN-Vision library, vision for chemistry, materials, and robotics.
**ABEJA** - Tokyo-based vision SaaS. Store analytics for retail and manufacturing.
**Recruit Holdings** - vision recruiting systems via subsidiaries like Indeed and JOBSRU.
**ALBERT (now Accenture Japan)** - industrial AI and vision consulting.
**Nikon AI** - medical imaging and industrial inspection. A camera company expanded into vision AI.
**LeapMind** - strong embedded vision via the Blueoil quantized inference engine.
**SoftBank Robotics** - vision systems on Pepper and NAO.
**Fast Retailing (UNIQLO)** - in-house vision library for store camera analytics.
Japanese academia is strong in **OCR, document vision, and robotics vision**. UTokyo IIS, Kyoto U, Nagoya U, and Tokyo Tech regularly publish at CVPR.
Chapter 24 - The Posture of a 2026 Computer Vision Engineer
Boiling down 24 chapters into a single page:
**(1) Avoid training whenever possible.** A combination of pretrained models, open-vocabulary models, and VLMs covers 80%. Refining a Grounding DINO prompt takes less time than collecting 10,000 images.
**(2) Design pipelines, do not train models.** The job in 2026 is "which model to call, in what order", not "which loss to train on".
**(3) Divide labor between VLMs and traditional models.** Ask the VLM "what to find", ask the traditional model "where it is". This is the standard shape of a 2026 vision system.
**(4) Do not underestimate the inference runtime.** A model that runs at 30 fps in PyTorch hits 200 fps in TensorRT. A 6x gap is a different product in user experience.
**(5) Check the license.** The YOLO family is AGPL. Dropping it straight into a company product triggers source disclosure. Know RT-DETR, D-FINE, and MMDetection as alternatives.
**(6) Leave "pixel work" to OpenCV.** Color spaces, resizing, decoding, encoding - OpenCV is still the fastest and most stable.
**(7) Labeling tools come before datasets.** Start with CVAT or Roboflow with SAM 2 integration. The era of drawing boxes by hand is over.
The future of a vision engineer is **not "the person who makes the model" but "the person who composes the models"**. There are many tools. Stand in the right line.
References
- OpenCV - https://opencv.org/
- OpenCV 5.x roadmap - https://github.com/opencv/opencv/wiki/OpenCV-5
- MediaPipe - https://developers.google.com/mediapipe
- MediaPipe Tasks API - https://developers.google.com/mediapipe/solutions/tasks
- Detectron2 - https://github.com/facebookresearch/detectron2
- Detectron2 Model Zoo - https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md
- Ultralytics YOLO - https://github.com/ultralytics/ultralytics
- Ultralytics docs - https://docs.ultralytics.com/
- MMDetection - https://github.com/open-mmlab/mmdetection
- MMPose - https://github.com/open-mmlab/mmpose
- Roboflow - https://roboflow.com/
- Roboflow Supervision - https://github.com/roboflow/supervision
- HuggingFace Transformers Vision - https://huggingface.co/docs/transformers/main/en/tasks/object_detection
- Segment Anything 2 - https://github.com/facebookresearch/sam2
- SAM 2 paper - https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/
- Grounding DINO - https://github.com/IDEA-Research/GroundingDINO
- Grounding SAM - https://github.com/IDEA-Research/Grounded-Segment-Anything
- Florence-2 - https://huggingface.co/microsoft/Florence-2-large
- YOLO-World - https://github.com/AILab-CVC/YOLO-World
- DUSt3R - https://github.com/naver/dust3r
- MASt3R - https://github.com/naver/mast3r
- VGGT - https://github.com/facebookresearch/vggt
- Depth Anything v2 - https://github.com/DepthAnything/Depth-Anything-V2
- Marigold - https://github.com/prs-eth/Marigold
- Apple DepthPro - https://github.com/apple/ml-depth-pro
- DINOv3 - https://github.com/facebookresearch/dinov3
- CLIP - https://github.com/openai/CLIP
- SigLIP 2 - https://huggingface.co/collections/google/siglip2
- DWPose - https://github.com/IDEA-Research/DWPose
- ByteTrack - https://github.com/ifzhang/ByteTrack
- ONNX Runtime - https://onnxruntime.ai/
- TensorRT - https://developer.nvidia.com/tensorrt
- OpenVINO - https://docs.openvino.ai/
- Apple CoreML - https://developer.apple.com/documentation/coreml
- TFLite - https://www.tensorflow.org/lite
- NCNN - https://github.com/Tencent/ncnn
- MNN - https://github.com/alibaba/MNN
- CVAT - https://github.com/cvat-ai/cvat
- Label Studio - https://labelstud.io/
- Naver Labs Europe - https://europe.naverlabs.com/
- Lunit - https://www.lunit.io/
- Preferred Networks - https://www.preferred.jp/en/
현재 단락 (1/459)
Computer vision in the 2010s had a clear shape. SIFT, HOG, and Haar extracted features, SVMs and ran...