Skip to content

필사 모드: Vision-Language Models (VLMs) 2026 Deep Dive — CLIP, LLaVA, InternVL3, Qwen2.5-VL, GPT-4o, Gemini 2.5, Claude 4.7, DINOv2, SAM 2, and Florence-2

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Intro — As of May 2026, VLMs catch up to text LLMs in pace

As recently as 2024 the VLM landscape was dominated by "GPT-4V wins, open models trail by miles." As of May 2026 that gap has effectively closed. **Qwen2.5-VL 72B, InternVL3 78B, LLaVA-NeXT-Interleave, Pixtral Large, Molmo 72B, and MiniCPM-V 3.0** sit within single-digit percentage points of GPT-4o, Claude 4.7 Vision, and Gemini 2.5 Pro Vision on MMMU, MathVista, and ChartQA. At the same time, **on-device VLMs** are actually shipping in Apple Intelligence Vision, Samsung Galaxy AI, and ASUS NPU laptops.

This is not a marketing matrix. It is a single document covering which VLMs go where in production, and how to train, evaluate, and serve them. We compare CLIP-family fundamentals, LLaVA's two-stage alignment, Qwen-VL's three-stage training, MMMU and MathVista evaluation, and vLLM/SGLang serving — with real APIs.

VLM landscape 2026 — five distinct branches

The big picture first. The May 2026 VLM market splits into five branches.

1. **CLIP family (contrastive)**: joint image-text embedding. CLIP, SigLIP, EVA-CLIP. The backbone for retrieval, ranking, filtering.

2. **Open generative VLMs**: LLaVA-NeXT, InternVL3, Qwen2.5-VL, Pixtral, Molmo, Idefics3, MiniCPM-V. The core of "look at an image, produce text."

3. **Closed frontier VLMs**: GPT-4o Vision, Claude 4.7 Vision, Gemini 2.5 Pro Vision. API only.

4. **Vision foundations (no text)**: DINOv2/v3, SAM 2, Florence-2. Self-supervised vision backbones + general-purpose segmentation/detection.

5. **Diffusion-based vision (generation)**: Stable Diffusion 3.5, FLUX.1, DALL-E 3. They generate images rather than understand them.

This post focuses on 1-4. Branch 5 (diffusion) deserves its own post. When people say "VLM" they usually mean 2-3 (generative), but in production pipelines branches 1 (CLIP) and 4 (DINO/SAM/Florence) still show up alongside as preprocessing, retrieval, and grounding components.

CLIP and its successors — contrastive learning as the starting line

The VLM story starts with OpenAI's CLIP (2021). 400M (image, text) pairs trained with a **contrastive loss** so that a ViT image encoder and a text encoder embed into a shared space. The core idea: within a minibatch, push matching pairs to high cosine similarity and non-matching pairs to low similarity.

from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"

model, preprocess = clip.load("ViT-L/14", device=device)

image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)

texts = clip.tokenize(["a photo of a cat", "a photo of a dog", "a photo of a car"]).to(device)

with torch.no_grad():

image_features = model.encode_image(image)

text_features = model.encode_text(texts)

logits_per_image, logits_per_text = model(image, texts)

probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)

As of May 2026 almost nobody uses original CLIP weights. The de facto standards are **SigLIP, SigLIP 2, EVA-CLIP, and MetaCLIP**.

- **SigLIP (Google, 2023)**: sigmoid loss instead of softmax. Trains well without huge batches and improves accuracy. arXiv:2303.15343.

- **SigLIP 2 (Google, 2024)**: better multilingual + local features. Major improvements for Korean and Japanese retrieval.

- **EVA-CLIP (BAAI)**: ViT-E/14, ViT-G/14 scale. Open SOTA embeddings.

- **MetaCLIP (Meta, 2024)**: published the data curation recipe. Consistent improvement over CLIP at matched model size.

CLIP-family models remain the first pick for "image retrieval backbone for RAG," "dataset filtering," "zero-shot classification," and "video clip ranking" in 2026. Even with generative VLMs everywhere, this slot is not going away.

LLaVA — the de facto standard for visual instruction tuning

Open generative VLMs trace back to **LLaVA (Large Language and Vision Assistant)**. Since the first paper in April 2023 (arXiv:2304.08485), the line evolved through LLaVA-1.5, LLaVA-NeXT, LLaVA-OneVision, and now LLaVA-NeXT-Interleave is the reference architecture in May 2026.

LLaVA's core is two things.

1. **A simple projector (alignment) layer**: a small MLP (or Q-Former variant) that maps CLIP/SigLIP vision-encoder tokens into the LLM's embedding space. The vision encoder stays frozen, the LLM starts frozen, only the projector is trained.

2. **Two-stage training**:

- **Stage 1 (feature alignment)**: train only the projector on image-caption pairs.

- **Stage 2 (visual instruction tuning)**: fine-tune projector + LLM on GPT-4-synthesized instruction data (LLaVA-Instruct).

This simplicity is why LLaVA became "the shortest path to add vision to my LLM." As of 2026, LLaVA-NeXT supports many LLM backbones: Vicuna, Mistral, Llama 3.1/3.3, Qwen 2.5, etc.

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

from PIL import Image

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

model = LlavaNextForConditionalGeneration.from_pretrained(

"llava-hf/llava-v1.6-mistral-7b-hf",

torch_dtype=torch.float16,

device_map="auto",

)

image = Image.open("chart.png")

prompt = "[INST] <image>\nWhat is the trend shown in this chart? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=256)

print(processor.decode(output[0], skip_special_tokens=True))

InternVL3 — the current open VLM champion

Shanghai AI Lab's **InternVL3 (Q4 2025)** is the open VLM family with the highest MMMU scores at the time of writing. Sizes: 1B, 2B, 8B, 14B, 38B, 78B.

What sets InternVL3 apart:

- **InternViT-6B / InternViT-300M** are vision encoders trained in-house, not borrowed off-the-shelf. Dynamic resolution and tiling supported natively.

- **MLP projector + LLM (InternLM, Qwen)** combo. Similar to LLaVA but trained on far more data.

- **Multi-stage training**: pretraining → multimodal SFT → DPO (direct preference optimization) → optional RLHF.

- **Multilingual**: reasonable English, Chinese, Korean, and Japanese. Korean OCR is acceptable.

InternVL3-78B reaches within single-digit points of GPT-4o (2024-08), Claude 3.7 Vision, and Gemini 2.0 Pro on MMMU as of May 2026. Licensing: based on InternLM, so check the license terms for commercial use.

Qwen2.5-VL — Alibaba's three-stage training recipe

Alibaba's Qwen team released **Qwen2.5-VL (3B, 7B, 32B, 72B)** in January 2025, and it has become one of the two pillars of the open VLM market. The recipe is three stages.

1. **Stage 1 — vision encoder pretraining**: train an in-house ViT on massive image-text pairs.

2. **Stage 2 — multimodal pretraining**: train ViT + projector + LLM jointly on large interleaved image-text data.

3. **Stage 3 — instruction tuning**: high-quality SFT + DPO to lock in instruction following.

Qwen2.5-VL treats **video input** and **grounding** as first-class citizens. A request like "output the coordinates of the red car in this image as (x1,y1,x2,y2)" works natively. The 32B/72B variants are also tuned for **agent** workflows — looking at UI screenshots and emitting the next action, which slots straight into Anthropic Computer Use-style tasks.

Licensing is a mix of Apache 2.0 (7B and below) and Qwen Research License (32B/72B). Check per-model terms before commercial deployment.

Pixtral · Molmo · Idefics3 · MiniCPM-V — the other key open VLMs

Beyond InternVL3 and Qwen2.5-VL, the following models hold their own as of May 2026.

- **Pixtral 12B / Pixtral Large (Mistral, 2024-2025)**: in-house vision encoder + Mistral Large backbone. Apache 2.0 (12B) / MRL (Large). EU data and language friendly.

- **Molmo (Allen AI, 2024)**: trained on PixMo. Specialized in **pointing** — precise coordinate output on images. 1B/7B/72B, Apache 2.0.

- **Idefics3 (Hugging Face, 2024)**: fully open data and training code. Reproducibility is the win.

- **MiniCPM-V 3.0 (OpenBMB, 2025)**: claims GPT-4V-level results below 8B. First pick for edge and on-device.

- **Phi-3.5-Vision / Phi-4-Multimodal (Microsoft)**: small VLMs around 4B that actually run on laptops.

- **CogVLM2 / GLM-4V (Zhipu AI)**: strong in the Chinese market and reasonable in Korean.

Selection guide: **clean data licensing** first → Idefics3 or Molmo. **OCR / documents** first → InternVL3 or Qwen2.5-VL. **Agent / UI** first → Qwen2.5-VL 32B+. **On-device** first → MiniCPM-V or Phi-3.5-Vision.

Closed frontier VLMs — GPT-4o · Claude 4.7 · Gemini 2.5

Closed models still have an edge in some areas (chart accuracy, document extraction, multi-image reasoning, safety).

- **GPT-4o Vision (OpenAI)**: pass an `image_url` or base64 image in `chat.completions.create`. `gpt-4o` and `gpt-4o-mini` cover the cost/latency trade-off.

- **Claude 4.7 Vision (Anthropic)**: an `image` content block inside `messages.create`. Its 1M context can chew through dozens of PDF pages at once. Strong at charts, tables, and diagrams.

- **Gemini 2.5 Pro / Flash Vision (Google)**: native video input, long context, accepts YouTube URLs directly.

OpenAI Vision API call:

from openai import OpenAI

client = OpenAI()

with open("invoice.png", "rb") as f:

b64 = base64.b64encode(f.read()).decode()

resp = client.chat.completions.create(

model="gpt-4o-2026-05",

messages=[{

"role": "user",

"content": [

{"type": "text", "text": "Extract invoice number, date, total amount as JSON."},

{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},

],

}],

)

print(resp.choices[0].message.content)

The decisive closed-model advantages are stability, safety filters, and multi-image context consistency. Open models have basically caught up on single-image tasks, but for jobs like "extract consistent fields from a 30-page PDF" or "diff several images" the closed frontier still feels a little more reliable.

DINOv2 · DINOv3 — vision backbones trained without text

Where CLIP needs (image, text) pairs, **DINOv2 (Meta, 2023)** is a self-supervised ViT backbone trained without any text at all. arXiv:2304.07193. It produces representations that transfer beautifully to detection, segmentation, and depth estimation — even without fine-tuning.

A **DINOv3** (tentative name) or its successor, released in late 2025, scales further with larger curated natural-image data (about 1.7B images) and bigger models. As of May 2026 the DINOv2 line is still the most widely used in production.

- ViT-S/14, ViT-B/14, ViT-L/14, ViT-g/14 lineup.

- Frozen features are already strong for segmentation, detection, and classification.

- DINOv2 + a linear classifier reaches 84%+ on ImageNet-1k with no fine-tuning at all.

In industry, "vision tasks that do not need text alignment" (anomaly detection, industrial inspection, medical imaging pretraining) often work better starting from DINOv2 than CLIP.

SAM 2 — general-purpose segmentation for images and video

**Segment Anything Model 2 (Meta, 2024)** segments objects in both images and video, with **tracking across frames**. arXiv:2408.00714. Prompt one frame with points, boxes, or masks, and the mask propagates through the rest of the video.

As of May 2026 SAM 2 has become the standard for:

- **Video annotation automation**: labeling companies use SAM 2 in-the-loop and report 70%+ cost reduction.

- **Robotics / autonomous driving perception assist**: specify the object to track once, get automatic segmentation across the sequence.

- **Grounding backend for VLMs**: when a VLM says "the red car in this photo," SAM 2 produces the precise mask.

SAM 2 itself does not accept text input. Text-to-object matching is bolted on with open-vocabulary detectors like GroundingDINO or OWL-ViT.

Florence-2 — Microsoft's multitask vision foundation

**Florence-2 (Microsoft, 2024)** handles captioning, detection, segmentation, OCR, and VQA in a single seq2seq vision foundation. arXiv:2311.06242. Only two sizes — 0.23B (base) and 0.77B (large) — yet it competes with single-task SOTA models of similar size.

The trick is **task prompts**: special tokens like `<CAPTION>`, `<DETAILED_CAPTION>`, `<OD>`, `<DENSE_REGION_CAPTION>`, `<OCR>` switch the task. A solid pick when you need a "vision Swiss Army knife" at the edge or on device.

VLM training datasets — from LAION to ShareGPT4V

Dataset quality determines VLM quality. Core datasets as of May 2026:

- **LAION-5B / LAION-COCO / LAION-Aesthetics**: 5B-pair scale. Parts have been pulled for copyright and safety reasons, yet it remains the largest open corpus. The base for CLIP and SigLIP training.

- **DataComp / DataComp-1B**: a benchmark that pits curation strategies against each other, plus a curated 1B-pair release.

- **COYO-700M (Kakao Brain)**: open release from Kakao Brain. Korean-friendly.

- **ShareGPT4V**: high-quality captions and instructions generated with GPT-4V. Decisive for LLaVA-1.5/NeXT.

- **LLaVA-Instruct-150K / 665K**: the de facto standard for visual instruction tuning data.

- **The Cauldron (Hugging Face)**: a bundle of 50 datasets used to train Idefics2/3.

- **OBELICS**: large-scale interleaved image-text documents extracted from the web.

- **AI2D, ScienceQA, ChartQA, DocVQA, TextVQA**: domain-specific sets used for both evaluation and training.

License hygiene matters. With the EU AI Act entering force in 2026, disclosure of training data is becoming increasingly mandatory. That makes "fully open" models like Idefics3, Molmo, and OpenFlamingo more valuable in practice.

VLM evaluation — MMMU · MathVista · MMVet · ChartQA · DocVQA · RealWorldQA

VLM evaluation is more fragmented than LLM evaluation. Core benchmarks:

- **MMMU (Massive Multi-discipline Multimodal Understanding)**: university-level exams across 30 disciplines. As of May 2026 it serves as a VLM "general IQ." eval.ai/web/challenges/challenge-page/2179.

- **MMMU-Pro**: harder variant with text clues removed. Requires real visual reasoning.

- **MathVista**: mathematical visual reasoning. Charts, geometry, plots.

- **MMVet / MMBench / SEED-Bench**: general evaluation with per-category strengths.

- **ChartQA / DocVQA / InfographicVQA**: chart, document, and infographic understanding.

- **TextVQA / ST-VQA**: read text in images.

- **RealWorldQA (xAI)**: spatial reasoning on real-world photos.

- **Video-MME / MVBench / VideoMME**: video VLM evaluation.

- **CV-Bench**: classic vision tasks (classification, detection, depth) re-framed as VLM tasks.

As of May 2026 the top of the MMMU leaderboard goes GPT-4o (2024-11+), Gemini 2.5 Pro, Claude 4.7 Vision, InternVL3-78B, Qwen2.5-VL-72B, Molmo-72B, Pixtral Large. Open-vs-closed gap is now 5-8 percentage points.

OCR-centric VLMs — GOT-OCR 2.0 · Nougat · Donut

Document, table, and formula OCR is still a weak spot for general VLMs. As of May 2026 OCR-specific VLMs hold this niche.

- **GOT-OCR 2.0 (StepFun, 2024)**: arXiv:2409.01704. 580M parameters claim GPT-4V-level OCR. General text, formulas, sheet music, chemistry, and charts — one model.

- **Nougat (Meta, 2023)**: arXiv:2308.13418. Converts academic PDFs to markdown. Strong on equations.

- **Donut (Naver Clova, 2022)**: arXiv:2111.15664. OCR-free document understanding. Strong on Korean receipts and card statements.

- **Surya (VikParuchuri OSS)**: 90-language OCR. The most practitioner-friendly OSS license.

- **Mistral OCR (2025)**: Mistral launched a dedicated OCR API. Top-tier extraction accuracy.

General VLMs (InternVL3, Qwen2.5-VL) have gotten much better at OCR, but for **forms, tables, multi-column layouts, and equations** the dedicated models still win on both accuracy and cost.

Video VLMs — Video-LLaVA · VideoLLaMA · InternVideo · Qwen2-VL-Video

Moving from images to video shrinks the model pool considerably. Core video VLMs as of May 2026:

- **Video-LLaVA (PKU, 2023)**: arXiv:2311.10122. Unified image and video encoder + LLM.

- **VideoLLaMA 2/3 (DAMO)**: extends to audio for full multimodal.

- **InternVideo 2 (Shanghai AI Lab)**: video foundation. Strong for action recognition and retrieval.

- **Qwen2.5-VL Video**: a single model unifies image and video. Uses explicit temporal ID tokens.

- **LongVU (Meta)**: specialized for long-video compression.

- **MovieChat / VideoChat / Video-ChatGPT**: conversational video assistants.

The fundamental video VLM problem is **token explosion**. 30 fps times 60 seconds = 1800 frames, and if each frame costs 256-1024 tokens, the LLM context blows up immediately. Every video VLM is really about how it does **frame sampling, token compression, and temporal pooling**.

Efficient inference — how vLLM · SGLang · TensorRT-LLM handle VLMs

As of May 2026 the VLM serving stack is clear.

- **vLLM 0.7+**: PagedAttention plus image-token caching. First-class support for LLaVA, Qwen2.5-VL, InternVL2/3, Pixtral, Idefics3, MiniCPM-V, and friends.

- **SGLang**: RadixAttention plus structured decoding. Strong on multi-image and interleaved inputs.

- **TensorRT-LLM (NVIDIA)**: lowest latency on H100/H200/B200. VLMs require an ONNX export + TRT engine build.

- **MLC-LLM / llama.cpp**: on-device. Phi-3.5-Vision and MiniCPM-V run on iPhone, Android, Mac mini.

Typical pattern: serve Qwen2.5-VL behind an OpenAI-compatible API with vLLM:

pip install "vllm>=0.7.0"

vllm serve Qwen/Qwen2.5-VL-7B-Instruct \

--max-model-len 32768 \

--gpu-memory-utilization 0.92 \

--limit-mm-per-prompt image=4 \

--tensor-parallel-size 1 \

--host 0.0.0.0 --port 8000

Clients use the OpenAI SDK as-is — just pass the image as a base64 `image_url`.

Production deployment — token budgets · batch preprocessing · caching

When you put VLMs in real services the critical levers differ from text LLMs.

1. **Image token cost**: an image consumes 256-3000 tokens. Control via resolution and tiling — Qwen2.5-VL's `min_pixels`/`max_pixels`, InternVL3's `max_num_tiles`, OpenAI's `detail: low/high/auto`.

2. **Batch image preprocessing**: PIL is single-threaded and bottlenecks. Use `Pillow-SIMD` plus multiprocessing, or GPU decoding (NVIDIA DALI).

3. **Image caching**: cache embeddings and tokens by SHA256 if the same image repeats. Redis or an object store.

4. **Content safety**: NSFW classifiers and OCR-based PII filters in front. A CLIP-based safety classifier is nearly free.

5. **Pre-estimate the token budget**: before responding, compute input image tokens up front and expose cost to the user.

6. **PDFs and multi-image**: split per page and process in parallel. Claude 4.7 Vision handles PDF natively; others should convert pages to PNG via PyMuPDF.

One-line token budget estimate (Qwen2.5-VL):

def estimate_image_tokens(width: int, height: int, min_pixels=256*28*28, max_pixels=1280*28*28) -> int:

pixels = width * height

pixels = max(min_pixels, min(max_pixels, pixels))

Qwen2.5-VL uses 28x28 patches and merges them 2x2

patches = pixels / (28 * 28)

tokens = int(patches / 4)

return tokens

print(estimate_image_tokens(1920, 1080)) # ~1064 tokens

VLM fine-tuning — LoRA · QLoRA · SwiftVLM

Two practical approaches to adapt an open VLM to your domain:

- **LoRA / QLoRA adapters**: LoRA on the LLM backbone's q_proj/k_proj/v_proj/o_proj, full training on the projector, vision encoder frozen by default.

- **Full fine-tuning**: only if you have lots of data and GPUs. Unfreezing the vision encoder improves caption quality steeply.

Recommended tools: **LLaMA-Factory, ms-swift (SwiftVLM), Unsloth Vision, axolotl**. As of May 2026 **ms-swift** has the broadest coverage of Qwen, InternVL, LLaVA, and Idefics.

The de facto training data format is ShareGPT / LLaVA-style JSON. A sample looks like `{"image": "path/to.jpg", "conversations": [...]}`, which is compatible with visual instruction tuning corpora.

Grounding and region-level understanding — VLMs that output coordinates

One of the biggest 2026 shifts is **grounding** going mainstream. Instead of just saying "this is a car," the model outputs "a car at (x1, y1, x2, y2)" precisely.

Key models:

- **Qwen2.5-VL**: bbox, points, polygons as tokens. Strong for UI automation.

- **Molmo**: pointing specialist. Precise coordinate output on screen.

- **CogVLM2-Grounding**: detection- and segmentation-friendly tokens.

- **Florence-2**: switch tasks via task prompts.

- **Kosmos-2 (Microsoft)**: early standardization of interleaved text + bounding box tokens.

This unlocks **agent workflows**. When a VLM directly emits coordinates for "click the Save button in this screenshot," you can click without an additional detection model. Claude Computer Use, OpenAI Operator, and Anthropic Computer Use all use this pattern.

Korea's VLM scene — HyperCLOVA X Vision · LG EXAONE Vision · NAVER Cloud

Korea has produced its own VLMs.

- **HyperCLOVA X Vision (NAVER)**: specialized for Korean documents. Strongest in the Korean domain for receipts, ID cards, and chart extraction. Available via NAVER Cloud API.

- **EXAONE Vision (LG AI Research)**: multimodal extension of the EXAONE 3.5/4.0 lineup. Strong in industrial and scientific domains.

- **HCX-DASH (NAVER)**: smaller multimodal. Strong on Korean OCR and VQA.

- **Kanana / Kanana-V (Kakao)**: Kakao's own LLM with a vision extension.

- **KoLLaVA, KORani, MAUM Vision**: Korean VLMs from academia and SMBs.

- **COYO-700M (Kakao Brain)**: dataset contribution.

- **Upstage Solar Vision**: vision extension of Solar Pro. Strong for document and table extraction, dual-strong in English and Korean.

Korean OCR and document understanding still favor domestic models. For general multimodal reasoning, the Korean performance of InternVL3 and Qwen2.5-VL is reasonable enough that "open model + Korean SFT" is also common.

Japan's VLM scene — Stockmark · Sakana AI · ABEJA · Preferred Networks

Japan has a solid VLM ecosystem of its own.

- **Stockmark-VL / Stockmark-100B-VL**: specialized in Japanese business documents and news.

- **Sakana AI EvoVLM-JP**: built an efficient Japanese VLM via evolutionary model merging. arXiv:2403.13187.

- **ABEJA LUCAS Vision**: Japanese industrial domains.

- **Preferred Networks PLaMo-Vision**: vision extension of the PLaMo lineup. Strong in medical and robotics.

- **NEC cotomi Vision**: Japanese enterprise document processing.

- **CyberAgent CALM Vision**: advertising and media applications.

- **LINE / Yahoo LY Corporation Vision**: in-house search and content moderation.

Japanese OCR, documents, and tables favor domestic models. As in Korea, the standard playbook is "global open model + domestic language fine-tuning + domestic domain data."

Combination patterns — how real production stacks look

Seven combinations seen in real production stacks as of May 2026:

1. **E-commerce search**: SigLIP 2 + ChromaDB/Qdrant + GPT-4o re-ranking. The standard for image similarity search.

2. **Financial document extraction**: Claude 4.7 Vision (PDF native) + custom validation rules + Surya OCR fallback.

3. **E-commerce listing**: self-hosted InternVL3-38B + DINOv2 embeddings for duplicate-product detection.

4. **Content moderation**: SigLIP safety classifier + InternVL3 or Qwen2.5-VL for precise calls.

5. **Customer support image triage**: on-prem MiniCPM-V 3.0 + GPT-4o as fallback.

6. **Agent (computer use)**: Qwen2.5-VL-32B (or Claude 4.7) + SAM 2 + a custom action model.

7. **Medical / industrial inspection**: DINOv2 frozen backbone + a domain head. The standard for areas that do not need text alignment.

Routing multiple VLMs via LiteLLM, Portkey, or OpenRouter — and falling back to expensive closed models only for hard cases — has become a standard pattern.

Safety · governance · EU AI Act impact

VLM risk is more fragmented than LLM risk. Key issues as of May 2026:

- **PII exposure**: image OCR automatically reads IDs, credit cards, passports. Mask PII at ingestion.

- **Face recognition**: the EU AI Act effectively bans real-time face recognition in public spaces. Check local rules.

- **Copyright-tainted training data**: post-LAION, disclosing training-data provenance is becoming mandatory. Fully open models (Idefics3, Molmo, OpenFlamingo) gain value here.

- **NSFW / violence**: safety classifiers on both input and output.

- **Deepfake detection**: a separate classifier (WeVerify, Hive, Reality Defender) to detect generated images.

- **Healthcare use**: FDA, PMDA, and MFDS regulate medical AI specifically. VLMs often slot in as decision-support tools, but check case by case.

Adoption roadmap — from zero to production

A 6-week roadmap for teams adopting VLMs for the first time:

- **Week 1 — define the use case**: single-image classification? Document extraction? Agent action? RAG? Collect a 200-500 image eval set.

- **Week 2 — closed-model baseline**: evaluate with GPT-4o, Claude 4.7, Gemini 2.5. Measure cost, latency, accuracy.

- **Week 3 — open-model evaluation**: run InternVL3, Qwen2.5-VL, MiniCPM-V on the same eval set. Compare self-hosting cost via vLLM.

- **Week 4 — domain adaptation**: SFT (LoRA) on 1k-10k samples from your domain. If performance closes on the closed model, decide on self-hosting.

- **Week 5 — infra**: vLLM/SGLang + monitoring (W&B Weave, Langfuse, Arize Phoenix) + caching (Redis) + safety filters.

- **Week 6 — gradual rollout**: canary 5% → 25% → 100%. Monitor input-image distribution drift.

Common traps: picking a model solely on MMMU score; admitting traffic without safety filters; sending an entire PDF at once and blowing up tokens; calling on the same image repeatedly with no cache.

Closing — As of May 2026, VLMs are baseline infrastructure

In 2024 the answer was "use GPT-4V." In May 2026 the answer splits.

- **Single-image reasoning**: open models are enough. InternVL3 / Qwen2.5-VL as defaults.

- **PDF · multi-image · consistency**: Claude 4.7 Vision still leads.

- **OCR · document extraction**: dedicated models (GOT-OCR 2.0, Mistral OCR, Surya) are more accurate and cheaper.

- **Agent · UI automation**: Qwen2.5-VL 32B+ or Claude Computer Use.

- **On-device**: MiniCPM-V, Phi-3.5-Vision, Apple Intelligence Vision.

- **Vision backbone (no text)**: DINOv2/v3. CLIP only for retrieval.

VLMs have moved from "new tech that needs custom integration" to "baseline infrastructure, just call them like a text LLM." The differentiation over the next 12 months will not come from the models themselves but from **data curation · eval sets · domain SFT · safety · cost control**.

References

- CLIP — Learning Transferable Visual Models From Natural Language Supervision: arxiv.org/abs/2103.00020

- SigLIP — Sigmoid Loss for Language Image Pre-Training: arxiv.org/abs/2303.15343

- LLaVA — Visual Instruction Tuning: arxiv.org/abs/2304.08485

- LLaVA-1.5 — Improved Baselines with Visual Instruction Tuning: arxiv.org/abs/2310.03744

- Qwen-VL: arxiv.org/abs/2308.12966

- Qwen2-VL: arxiv.org/abs/2409.12191

- InternVL: arxiv.org/abs/2312.14238

- DINOv2: arxiv.org/abs/2304.07193

- Segment Anything: arxiv.org/abs/2304.02643

- SAM 2: arxiv.org/abs/2408.00714

- Florence-2: arxiv.org/abs/2311.06242

- GOT-OCR 2.0: arxiv.org/abs/2409.01704

- Nougat: arxiv.org/abs/2308.13418

- Donut: arxiv.org/abs/2111.15664

- Video-LLaVA: arxiv.org/abs/2311.10122

- Kosmos-2: arxiv.org/abs/2306.14824

- Sakana AI Evolutionary Optimization: arxiv.org/abs/2403.13187

- LLaVA GitHub: github.com/haotian-liu/LLaVA

- InternVL GitHub: github.com/OpenGVLab/InternVL

- Qwen2.5-VL HuggingFace: huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5

- vLLM Multimodal Docs: docs.vllm.ai/en/latest/models/supported_models.html

- SGLang: github.com/sgl-project/sglang

- MMMU Leaderboard: mmmu-benchmark.github.io

- MathVista: mathvista.github.io

현재 단락 (1/255)

As recently as 2024 the VLM landscape was dominated by "GPT-4V wins, open models trail by miles." As...

작성 글자: 0원문 글자: 23,544작성 단락: 0/255