- Published on
The 2026 Vision Model Development & Fine-Tuning Guide — CNN, ViT, DETR, SAM 2, VLMs and a Real Decision Tree
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — The Real Question a 2026 Vision Engineer Faces
A one-line ticket on a Monday morning in May 2026:
"Mark defective parts in our factory line camera feed. 12 classes, 300k images/day, accuracy 95% or better, runs on a Jetson Orin Nano at 30 fps."
In 2018 the answer would have been obvious — ResNet50 backbone, RetinaNet head, COCO pretraining, fine-tune on your data. Done.
In 2026 there are roughly eight answers.
- Just fine-tune YOLOv11/12.
- Use RT-DETR for transformer-based detection.
- Run SAM 2 for masks and stack a classifier on top.
- Prompt a vision foundation model like Florence-2.
- Send the photo to Gemini 2.5 Vision or Claude Vision and parse the natural-language result.
- Extract CLIP embeddings and do kNN classification.
- Use OWLv2 / Grounding DINO for text-prompted zero-shot detection.
- Pipeline two or three of the above.
This guide is about when, why, and how to choose among those eight. Not "the latest SOTA wins" — a decision tree across data size, accuracy target, latency budget, and operating cost.
The core skill of a 2026 vision engineer is no longer "train a model." It is "decide which model to train, and whether to train at all."
1. Architecture Families — From CNN to VLM at a Glance
Every vision model is "image to something." The "something" decides the family.
| Family | Representative models | First appeared | Input handling | Strength | Weakness |
|---|---|---|---|---|---|
| CNN | ResNet, EfficientNet, ConvNeXt v2 | 2012~ | Convolution + pooling | Small data, fast | Weak global context |
| ViT | ViT, DeiT, Swin v2, EVA-02 | 2020~ | Patches + self-attention | Strongest when data is plentiful | Weak with small data |
| DETR family | DETR, Deformable DETR, RT-DETR | 2020~ | Encoder-decoder + queries | NMS-free detection | Slow convergence |
| SAM family | SAM, SAM 2, HQ-SAM | 2023~ | ViT backbone + mask decoder | Promptable segmentation | No semantic labels |
| VLM | LLaVA-1.6, Qwen2.5-VL, Gemini Vision, Claude Vision, GPT-4V | 2023~ | Image encoder + LLM | Natural-language reasoning, OCR, VQA | Expensive, slow, non-deterministic |
| Multimodal foundation | Florence-2, InternVL3, DINOv2 | 2023~ | Unified ViT, multi-task heads | Zero-shot, few-shot | Fine-tuning is non-trivial |
Memorize this table. The next eight chapters expand each row.
Core principle: every vision model ultimately "sees the image as a sequence of tokens" — a CNN as a spatial grid, a ViT as a patch sequence, SAM together with prompts, a VLM as input tokens to an LLM. Representation defines the model.
2. CNNs Are Not Dead — Where They Win in 2026
Despite marketing claims that ViT ate everything, CNNs are very much alive in 2026. Especially in these situations.
Pick a CNN when
- Data is small — under 10k labeled images. ViT without pretraining barely learns.
- You deploy at the edge — Jetson, Coral, mobile. A ConvNeXt-Tiny beats a ViT-Tiny at the same FLOPs.
- Latency is brutal — sub-millisecond. A small CNN can hit 0.5 ms on a GPU.
- Resolution is huge — 4K medical imaging. ViT patch counts explode.
One-line model loading with timm
PyTorch Image Models (timm) remains the de-facto standard for vision backbones in 2026. Over 1,000 pretrained backbones, one line.
import timm
import torch
# ConvNeXt v2 large, ImageNet-22k pretrain, 22k-1k fine-tune
model = timm.create_model(
'convnextv2_large.fcmae_ft_in22k_in1k_384',
pretrained=True,
num_classes=12, # our task's class count
)
# The model tells you the input transform it expects
data_config = timm.data.resolve_model_data_config(model)
transform = timm.data.create_transform(**data_config, is_training=True)
# Tensor shape including the batch dim is `B x C x H x W`
x = torch.randn(2, 3, 384, 384)
logits = model(x) # shape: 2 x 12
Use timm.list_models('convnext*', pretrained=True) to see candidates. The pretrained_cfg contains mean, std, input_size, and crop_pct, so transforms stay consistent.
CNN training recipe — making small data work
- Pretrain to head-only training to full fine-tuning — three stages.
- Mixup, CutMix, RandAugment — almost mandatory below 10k labels.
- EMA (Exponential Moving Average) — 1–2pp of validation accuracy for free.
- Cosine schedule with a short warmup —
OneCycleLRworks too. - AdamW with weight decay 0.05 — forget the old SGD.
3. ViT — The Champion When Data Is Plentiful
A ViT slices an image into patches like 16x16, treats them as a sequence, and stacks a Transformer on top. The key insight was "a model with less inductive bias beats a CNN if you give it enough data."
ViT variants worth knowing in 2026
- Swin Transformer v2 — window attention, efficient at high resolution.
- DeiT III — data-efficient training recipe.
- EVA-02 — masked image modeling, scaled to 22B parameters.
- DINOv2 — self-supervised, a powerful backbone trained without labels.
- SigLIP / SigLIP 2 — contrastive learning, strong image-text embeddings.
When to pick a ViT
| Condition | Recommendation |
|---|---|
| 100k+ labeled images | ViT or Swin |
| Few labels but 1M+ images for self-supervised pretraining | DINOv2 pretrain plus head-only fine-tuning |
| OCR, text-heavy imagery | SigLIP 2 or ViT-L patch 14 |
| Multilingual OCR, table understanding | InternVL3's ViT backbone |
One-line ViT classifier with Hugging Face transformers
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
from PIL import Image
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-large')
model = AutoModelForImageClassification.from_pretrained(
'facebook/dinov2-large',
num_labels=12,
ignore_mismatched_sizes=True, # reinitialize the head
)
img = Image.open('part.jpg').convert('RGB')
inputs = processor(images=img, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
pred = outputs.logits.argmax(-1).item()
Swap facebook/dinov2-large for microsoft/swinv2-large-patch4-window12-192-22k and the API is identical.
4. Object Detection — YOLO vs DETR vs RT-DETR
Detection solves "what is where" simultaneously. As of 2026 the practical space splits in two.
The YOLO family — overwhelming production share
From Ultralytics YOLOv8 through v12, plus YOLO-NAS, YOLOv9, YOLOv10. For speed and deployment ergonomics, YOLO still wins.
from ultralytics import YOLO
# Load a pretrained model
model = YOLO('yolo11x.pt')
# Fine-tune on your dataset
results = model.train(
data='factory_parts.yaml', # train/val paths plus class names
epochs=100,
imgsz=640,
batch=32,
device=0,
optimizer='AdamW',
lr0=0.001,
cos_lr=True,
patience=20, # early stopping
project='runs/factory',
)
# Export to ONNX (edge deployment)
model.export(format='onnx', dynamic=True, simplify=True)
# Export to TensorRT (Jetson)
model.export(format='engine', half=True)
factory_parts.yaml:
path: /data/factory
train: images/train
val: images/val
names:
0: scratch
1: dent
2: discoloration
3: missing_screw
YOLO's weaknesses: NMS-based detection struggles on dense or tiny objects, and global context is weaker than DETR-class models.
The DETR family — NMS-free transformer detection
DETR removed NMS by using object queries and Hungarian matching. In 2026 the practical choice is RT-DETR.
from transformers import RTDetrV2ForObjectDetection, RTDetrImageProcessor
processor = RTDetrImageProcessor.from_pretrained('PekingU/rtdetr_v2_r50vd')
model = RTDetrV2ForObjectDetection.from_pretrained(
'PekingU/rtdetr_v2_r50vd',
num_labels=12,
ignore_mismatched_sizes=True,
)
YOLO vs DETR — the decision
| Situation | Recommendation |
|---|---|
| Edge inference at 30 fps or more | YOLO |
| COCO-ish object density | YOLO |
| Dense, tiny objects (satellite, medical) | RT-DETR or Co-DETR |
| Text-prompted detection of unseen classes | Grounding DINO / OWLv2 |
| Video tracking too | YOLO plus ByteTrack, or SAM 2 |
When OpenMMLab fits
The mmdetection, mmsegmentation, and broader OpenMMLab ecosystem shines in research and experimentation. You can compare 50-plus detection models in one codebase and swap backbones with a config line. But the learning curve is steep, deployment is a separate toolchain, and for a production-first model Ultralytics ships much faster.
5. Segmentation and SAM 2
Segmentation is "where, per pixel." In 2026 SAM 2 is the starting point for almost every case.
SAM 2 — promptable segmentation for both images and video
Released by Meta in July 2024, SAM 2 takes a click, box, or mask prompt and "segments any object, then auto-tracks it across video frames." The core ideas:
- Unified image + video — one model for both.
- Memory attention — past-frame representations are stored to track objects over time.
- Promptable — segmentation is driven by user input (clicks, boxes), not predefined masks.
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
import numpy as np
from PIL import Image
sam2 = build_sam2('configs/sam2.1/sam2.1_hiera_l.yaml', 'sam2.1_hiera_large.pt')
predictor = SAM2ImagePredictor(sam2)
img = np.array(Image.open('part.jpg'))
predictor.set_image(img)
# One click to a mask
point_coords = np.array([[450, 320]])
point_labels = np.array([1]) # 1 = foreground
masks, scores, _ = predictor.predict(
point_coords=point_coords,
point_labels=point_labels,
multimask_output=True,
)
How to actually use SAM 2 — "SAM segments, something else labels"
SAM 2 does not tell you what something is. It says "this mask is object 1, that mask is object 2." For semantic labels you need a two-stage pipeline.
- SAM 2 produces masks.
- Crop each mask and classify with CLIP or SigLIP.
This is the 2026 standard recipe for "zero-shot segmentation." Works without any labeled data.
Where HQ-SAM, MobileSAM, and EfficientSAM fit
- HQ-SAM — sharper boundaries for medical and satellite imagery.
- MobileSAM / EfficientSAM — lightweight for edge devices.
- SAM 2.1 — improved video tracking precision.
6. VLMs — The Era of Talking to Images in Natural Language
A Vision-Language Model "turns images into input tokens for an LLM." Users ask in natural language; the model answers in natural language.
The major VLMs in 2026
| Model | Strengths | Weaknesses |
|---|---|---|
| Claude Sonnet/Opus Vision | Charts, diagrams, document OCR, reasoning | API-only, price |
| Gemini 2.5 Pro Vision | Long video, multi-image, multilingual OCR | API-only |
| GPT-5 Vision | General reasoning, code integration | API-only |
| Qwen2.5-VL 72B / 7B | Open weights, GUI understanding, video | You host it |
| LLaVA-OneVision / LLaVA-1.6 | Public recipe, research-friendly | Behind the frontier |
| InternVL3 78B / 8B | Multi-image, documents, open-weight leader | Heavy VRAM needs |
| Molmo | Strong pointing capability, transparent data | Average overall accuracy |
| Pixtral 12B | Mistral's open VLM | Weak OCR |
Using a VLM "just by prompting"
When you barely have labels, when classes change weekly, or when "explain why" is required — sending a photo and a system prompt to a VLM gives the highest ROI.
import anthropic
import base64
client = anthropic.Anthropic()
with open('part.jpg', 'rb') as f:
img_b64 = base64.standard_b64encode(f.read()).decode()
resp = client.messages.create(
model='claude-opus-4-7',
max_tokens=512,
system=(
'You are an automotive parts inspector. Look at the photo and answer ONLY '
'with this JSON. schema: {"defect": one of [scratch, dent, discoloration, '
'missing_screw, none], "severity": one of [low, medium, high], '
'"reasoning": short string}'
),
messages=[{
'role': 'user',
'content': [
{'type': 'image', 'source': {'type': 'base64', 'media_type': 'image/jpeg', 'data': img_b64}},
{'type': 'text', 'text': 'Inspect this part.'},
],
}],
)
print(resp.content[0].text)
VLM limits — they are not always the right answer
- Cost — 1 to 5 cents per image. 300k images/day means 90k–450k USD/month. Compare that to training a custom model.
- Latency — 200 ms to 2 s. Useless at 30 fps on edge.
- Non-determinism — slightly different answers on the same photo. Threshold-based decisions need calibration.
- Ignoring JSON — must be enforced via
response_formator tool use. - Data governance — can the photo legally go to an external API? Loop in legal before you ship.
7. Which Task to Which Architecture — The Matrix
Eight chapters compressed into one table. When picking the first model, this table alone gets you 80% of the way there.
| Task | Under 1k data | 10k–100k data | 100k+ data | Almost no labels |
|---|---|---|---|---|
| Image classification | CLIP zero-shot or ConvNeXt-T fine-tune | ConvNeXt-Base, ViT-B | EVA-02, DINOv2-L pretrain plus head | CLIP/SigLIP zero-shot |
| Object detection | YOLO11n plus heavy augmentation | YOLO11x or RT-DETR | DETR variants plus custom backbone pretrain | Grounding DINO, OWLv2 |
| Segmentation | SAM 2 plus CLIP labeling | SAM 2 fine-tune or Mask2Former | Mask2Former plus Swin v2 | SAM 2 zero-shot |
| OCR | TrOCR fine-tune | TrOCR or PaddleOCR | Custom training plus synthetic data | Claude/Gemini Vision |
| Captioning | BLIP-2 prompt | BLIP-2 or LLaVA fine-tune | InternVL3 fine-tune | VLM direct call |
| Visual QA | Direct VLM API call | LLaVA-OneVision LoRA | Qwen2.5-VL 72B fine-tune | Direct VLM API call |
| Anomaly detection | PatchCore, PaDiM | EfficientAD | Custom training plus synthetic defects | DINOv2 embeddings plus kNN |
One-line rule: Small data, large pretrained backbone plus a small head. Large data, custom training. No labels, VLM or embeddings.
8. Data — How Much, How, From Where
The cliche "data matters more than the model" is still true in 2026.
How much data you actually need (rough numbers)
| Task | Minimum per class | Recommended | Comfortable |
|---|---|---|---|
| Classification (strong pretraining) | 50 | 500 | 5,000 |
| Object detection | 200 boxes | 2,000 boxes | 20,000 boxes |
| Segmentation (less with SAM 2) | 50 masks | 500 masks | 5,000 masks |
| OCR (line-level) | 1,000 lines | 10,000 lines | 100,000 lines |
| VLM fine-tuning (LoRA) | 200 examples | 2,000 examples | 20,000 examples |
Public datasets still alive in 2026
- ImageNet-22k / 1k — the unchanging benchmark for classification pretraining.
- COCO 2017 — detection, keypoints, captions; still the standard.
- Open Images V7 — 9M+ images with weak labels.
- LAION-5B / DataComp — large image-text pairs for CLIP-style training (check copyright).
- LVIS — 1,200+ classes, long-tail detection.
- ADE20K, Cityscapes, Mapillary — segmentation.
- DocVQA, ChartQA, InfographicVQA — document and chart VQA.
- OpenX-Embodiment, Ego4D — robotics and first-person video.
- SA-1B — SAM training data, 1.1B masks.
Labeling tools — a practical 2026 comparison
| Tool | Strengths | Weaknesses | Price |
|---|---|---|---|
| Label Studio | Open source, every task | UI feels heavy | Free / Enterprise paid |
| CVAT | Best for video detection and segmentation, open source | Self-hosting burden | Free / Cloud paid |
| Roboflow | Fast start, auto-labeling, SAM integration | Cloud-dependent | Free/Team/Enterprise tiers |
| V7 Darwin | Medical, complex workflows | Pricing | Paid |
| Encord | Video, LLM/VLM evaluation | Pricing | Paid |
| Scale AI / Surge AI | Outsourced human annotation | Per-hour or per-label cost | Service |
The 2026 labeling secret: auto-label with SAM 2 plus Grounding DINO first, then have humans correct in Roboflow or CVAT. Labeling time drops 5–10x.
9. Train vs Fine-Tune vs Prompt — The Cost Model
| Strategy | Data needed | Training cost | Inference cost | Switching cost | Accuracy ceiling |
|---|---|---|---|---|---|
| Train from scratch | 100M+ images | $100k+ | Lowest | Very high | Very high |
| Full fine-tune | 10k–1M | 10k | Low | Low | High |
| LoRA/QLoRA fine-tune | 200–50k | 1k | Low | Very low | Medium–high |
| VLM prompting | 0 to dozens of examples | $0 | Highest | $0 | Whatever the model can do |
| Embeddings plus kNN | 50–5k | 100 | Low | Low | Medium |
Rule of thumb: as data grows, training becomes economical; as inference traffic grows, owning the model becomes economical. You have to balance both axes.
LoRA / QLoRA — the practical standard for VLM fine-tuning
Full fine-tuning a VLM is painful even with 4x 80 GB VRAM. LoRA is the answer.
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
import torch
model_id = 'Qwen/Qwen2.5-VL-7B-Instruct'
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map='auto',
)
lora_cfg = LoraConfig(
r=16,
lora_alpha=32,
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM',
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters() # usually 0.1–1% trainable
cfg = SFTConfig(
output_dir='runs/qwen-vl-defect',
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=1e-4,
lr_scheduler_type='cosine',
warmup_ratio=0.03,
bf16=True,
gradient_checkpointing=True,
logging_steps=20,
save_strategy='epoch',
)
# train_dataset is a sequence of {"image": PIL.Image, "messages": [...]}
trainer = SFTTrainer(model=model, args=cfg, train_dataset=train_ds, processing_class=processor)
trainer.train()
QLoRA (4-bit quantization plus LoRA) lets you fine-tune a 7B VLM on a single 24 GB GPU. A 70B-class model fits on a single 80 GB card.
10. Deployment — ONNX, TensorRT, Core ML, TFLite
Training is the end of the beginning. A bad deployment choice multiplies your inference cost tenfold.
Recommended formats by target
| Target | Recommended format | Notes |
|---|---|---|
| NVIDIA GPU servers | TensorRT, or ONNX with TensorRT EP | FP16/INT8 quantization, dynamic batch |
| CPU servers | ONNX Runtime, OpenVINO | INT8 essential, lean on AVX-512 |
| Jetson (edge GPU) | TensorRT | Per-model engine build, JetPack version must match |
| iOS | Core ML | Convert via coremltools, target ANE |
| Android | TFLite, LiteRT, ONNX Runtime Mobile | NNAPI or GPU delegate |
| Web browser | ONNX Runtime Web, WebGPU | 1–50 MB model size |
| Local desktop LLM/VLM | llama.cpp (GGUF), Ollama, MLX | Apple Silicon excels |
PyTorch to ONNX to TensorRT flow
import torch
import torch.onnx
from ultralytics import YOLO
model = YOLO('runs/factory/best.pt')
# 1) ONNX
model.export(format='onnx', dynamic=True, simplify=True, opset=17)
# 2) TensorRT (NVIDIA GPU) — directly supported by Ultralytics
model.export(format='engine', half=True, workspace=4)
For manual conversion, trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 is the simplest path. INT8 quantization needs a calibration dataset.
Core ML and TFLite — mobile
# Core ML
import coremltools as ct
mlmodel = ct.convert(traced_model, inputs=[ct.ImageType(shape=(1, 3, 224, 224))])
mlmodel.save('Model.mlpackage')
# TFLite (PyTorch to TF, ai-edge-torch recommended)
import ai_edge_torch
edge_model = ai_edge_torch.convert(model.eval(), sample_inputs)
edge_model.export('model.tflite')
11. Failure Modes — The Real Problems You Hit in Production
| Symptom | Root cause | Remedy |
|---|---|---|
| 99% val accuracy, 70% in production | Domain shift | Add production samples or apply domain adaptation |
| Only one class predicted well | Class imbalance | Focal loss, class-balanced sampler, oversampling |
| Inference memory explodes | Dynamic shapes, batch-1 only training | Export with dynamic axes, fix a max batch |
| Different result on the same image | Non-determinism, half precision | Fix seeds, validate in FP32, set torch.backends.cudnn.deterministic |
| Small objects missed | Input too small, anchor mismatch | Increase resolution, slice-and-merge (SAHI) |
| VLM ignores JSON | Weak prompt | Force tool use or response_format=json_schema |
| Label noise | Poor annotation quality | Measure inter-annotator agreement (IAA), use confident learning |
| Fails to generalize after training | Data leak, val set overlaps train | Hash-based or time-based splits |
| GPU utilization stuck at 30% | Data loader bottleneck | More workers, persistent workers, NVIDIA DALI |
| Training diverges | LR too high, gradient explosion | Longer warmup, gradient clipping, disable mixed precision for debugging |
The field aphorism: "If accuracy isn't moving, don't swap the model — go look at the data. Four times out of five, it's the data."
12. The Decision Tree — Decide in 30 Seconds
When a new vision problem lands on your desk, ask these in order.
- "Can a VLM solve this in one prompt?" — Hand-feed Claude/Gemini Vision ten photos. If you get 90% accuracy, that's your baseline.
- "Do you have labels?" — No, then VLM or zero-shot (CLIP, Grounding DINO, SAM 2).
- "Do you have 10k+ labels?" — Yes, train your own. No, pretrain plus fine-tune.
- "Does it run at the edge?" — Yes, CNN or a small YOLO. No, ViT is fair game.
- "30 fps or higher?" — Yes, YOLO plus TensorRT/Core ML. No, DETR-class is fine.
- "Do you need to explain the decision?" — Yes, VLM or attention visualization. No, a normal model.
- "Are errors expensive?" (medical, autonomous driving) — Yes, ensembles, calibration, human-in-the-loop.
Seven questions handle 80% of the first-model decision.
Epilogue — Checklist, Anti-Patterns, Coming Next
Pre-launch checklist for your first vision model
- Have you looked at label distribution? (Class imbalance must be known before training.)
- Are train and val splits free of time- or source-overlap?
- Do you have a baseline? (Simplest model, or human accuracy, or majority class.)
- Have you measured one-prompt VLM accuracy and cost?
- Have you compared training cost against inference cost?
- Do you know the memory and latency limits of your deployment target?
- Have you measured inter-annotator agreement (IAA)?
- Is there a hand-curated set of 100 corner cases as a separate eval?
- Monitoring — how do you detect production drift?
- Rollback — can you go back to the previous model when this one breaks?
Common anti-patterns
- "What's the latest SOTA?" as the first question — go look at your data first.
- Starting from a ViT with no pretraining — almost always fails below 10k labels.
- Tuning while peeking at the val set repeatedly — that's training data. Hold out a real test set.
- Shipping on mAP alone — also check per-class PR curves, small-object metrics, and inference latency.
- Trusting VLM output without post-processing — JSON-schema validation, fallbacks, caching.
- Forcing one model to handle every class — split frequent classes from the long tail.
- Treating augmentation as an afterthought — often augmentation matters more than model choice.
- Train-eval transform mismatch — the number-one debug time sink.
Coming next
- "Operating vision models — drift detection, A/B, canary, and active learning loops."
- "Auto-labeling with VLMs — a Claude/Gemini/Qwen-VL pipeline that automates 90% of annotation."
- "Edge vision inference — running the same model on Jetson, Coral, iPhone, and Android."
References
Architecture papers
- ViT — "An Image is Worth 16x16 Words" — https://arxiv.org/abs/2010.11929
- Swin Transformer v2 — https://arxiv.org/abs/2111.09883
- ConvNeXt v2 — https://arxiv.org/abs/2301.00808
- DINOv2 — https://arxiv.org/abs/2304.07193
- DETR — "End-to-End Object Detection with Transformers" — https://arxiv.org/abs/2005.12872
- RT-DETR — https://arxiv.org/abs/2304.08069
- SAM — https://arxiv.org/abs/2304.02643
- SAM 2 — https://arxiv.org/abs/2408.00714
- LLaVA — https://arxiv.org/abs/2304.08485
- Qwen2.5-VL — https://arxiv.org/abs/2502.13923
- InternVL — https://arxiv.org/abs/2312.14238
- Grounding DINO — https://arxiv.org/abs/2303.05499
- Florence-2 — https://arxiv.org/abs/2311.06242
- LoRA — https://arxiv.org/abs/2106.09685
- QLoRA — https://arxiv.org/abs/2305.14314
Tools and libraries
- timm (PyTorch Image Models) — https://github.com/huggingface/pytorch-image-models
- Hugging Face transformers — https://huggingface.co/docs/transformers
- Ultralytics YOLO — https://docs.ultralytics.com/
- OpenMMLab MMDetection — https://github.com/open-mmlab/mmdetection
- Segment Anything 2 — https://github.com/facebookresearch/sam2
- PEFT (LoRA/QLoRA) — https://github.com/huggingface/peft
- TRL (SFT) — https://github.com/huggingface/trl
Labeling tools
- Label Studio — https://labelstud.io/
- CVAT — https://www.cvat.ai/
- Roboflow — https://roboflow.com/
- V7 Darwin — https://www.v7labs.com/
- Encord — https://encord.com/
Datasets
- ImageNet — https://www.image-net.org/
- COCO — https://cocodataset.org/
- Open Images V7 — https://storage.googleapis.com/openimages/web/index.html
- LAION — https://laion.ai/
- LVIS — https://www.lvisdataset.org/
- ADE20K — https://groups.csail.mit.edu/vision/datasets/ADE20K/
- SA-1B — https://ai.meta.com/datasets/segment-anything/
Deployment / inference
- ONNX Runtime — https://onnxruntime.ai/
- NVIDIA TensorRT — https://developer.nvidia.com/tensorrt
- Apple Core ML Tools — https://github.com/apple/coremltools
- ai-edge-torch (PyTorch to TFLite) — https://github.com/google-ai-edge/ai-edge-torch
- OpenVINO — https://docs.openvino.ai/
VLM APIs
- Anthropic Claude Vision — https://docs.anthropic.com/en/docs/build-with-claude/vision
- Google Gemini Vision — https://ai.google.dev/gemini-api/docs/vision
- OpenAI Vision — https://platform.openai.com/docs/guides/vision