Training Vision LLMs — How to Teach Input and Output

Introduction: Training Is Defining Inputs and Outputs
- What You Have Before Training Starts
Core Principles: A Staged Training Pipeline
Deep Dive 1: Freezing and Unfreezing the Vision Encoder
- Intuition of the Freezing Strategy
- Learning Rate and Stability
Deep Dive 2: Data — What You Show
Deep Dive 3: Input Format — How You Show It
- Chat Template and Image Placeholder
- Diverse Output Shapes
Deep Dive 4: Loss — What You Learn Against
- Loss Masking in Pseudocode
- Full Training Step in Pseudocode
Deep Dive 5: Teaching Grounding and Coordinates as Text
Deep Dive 6: Data Mix and Curriculum
- Interference Between Abilities
Deep Dive 7: Efficient Fine-tuning and Parameter Strategy
Deep Dive 7.5: Chat Template and Special Tokens
Deep Dive 7.6: Training Stability and Common Failures
The Alignment Stage: RLHF and DPO Briefly
Comparison Table: Training Stages at a Glance
Deep Dive 8: Evaluation and Training Monitoring
Deep Dive 9: Multimodal Batching and Training Efficiency
Deep Dive 10: Distributed Training and Memory
Deep Dive 11: Synthetic Data and Data Augmentation
Data Curation and Pitfalls
Closing
References

Introduction: Training Is Defining Inputs and Outputs

Training a vision-language model (VLM) is, in the end, the act of defining what input you give the model and what output you expect. With the same architecture, completely different abilities emerge depending on what you show as input and what you treat as the target. Train only on image captions and it describes well but cannot follow instructions; train on coordinates and it points at objects; train on documents and it reads tables and forms.

This post unpacks how a VLM is trained from the angle of input and output. It covers the staged training pipeline, vision encoder freezing strategy, data types, input format and output shapes, loss computation, and the alignment stage. The architecture itself is covered in a separate post, so here we focus on how to teach that structure.

What You Have Before Training Starts

You do not learn everything from scratch. Usually you start from two well-trained parts. One is a vision encoder pretrained on large-scale images, the other an LLM pretrained on large-scale text. Each is strong in its own domain, but they do not know how to talk to each other.

starting point
  vision encoder (pretrained)  sees images well but not aligned with language
  LLM (pretrained)             handles language well but has never seen an image
  projector (random init)      knows nothing yet

goal
  make the three parts cooperate to see an image and answer in language

So the first task of VLM training is to lay a bridge (the projector) connecting the two strong parts and align so information flows over that bridge. Remember this starting condition and it becomes natural why training starts from the alignment stage and why the projector is trained first.

Core Principles: A Staged Training Pipeline

Most VLMs are not trained in one shot; they build ability gradually in stages. The typical flow has three stages.

Stage 1: vision-language alignment pretraining
   goal: align visual tokens into the LLM language space
   data: large-scale image-caption pairs
   trained: mostly the projector (vision encoder/LLM tend to stay frozen)

Stage 2: multitask pretraining
   goal: broadly acquire visual abilities (caption, VQA, OCR, grounding)
   data: mixed multitask corpus
   trained: projector + LLM (+ optionally unfreeze part of vision encoder)

Stage 3: instruction fine-tuning
   goal: instruction following, dialogue, format adherence
   data: high-quality instruction-response (with images)
   trained: mostly LLM (+ projector)

Stage 1 is the alignment stage that puts vision and language into the same semantic space. Here you only need to make the model able to roughly say what an image is. Stage 2 broadens the range of abilities, and stage 3 refines the model to act on human instructions. Later stages tend to use less data of higher quality.

Deep Dive 1: Freezing and Unfreezing the Vision Encoder

One of the most important decisions in training is which component to train and when. Whether to freeze or unfreeze the vision encoder is especially central.

Intuition of the Freezing Strategy

Freeze the vision encoder early: a pretrained vision encoder already emits good visual features. Shaking it during early alignment easily breaks it, so usually only the projector is trained.
Gradually unfreeze later: in the ability-boosting stage, unfreeze part or all of the vision encoder to fine-tune it to a domain (documents, charts). Use a low learning rate so existing representations do not collapse.
The LLM too, in stages: the LLM backbone is also handled by freezing or using a low learning rate during early alignment, then training in earnest during the instruction stage.

        Stage 1       Stage 2          Stage 3
vision-enc  freeze   partial/full unfreeze   freeze or low lr
projector   train    train               train
LLM         freeze   train (low lr)      train

This table is a common pattern, not an absolute rule. The combination changes with model and data scale. The core intuition is: do not casually shake parts that work well, and start by training the part that needs alignment (the projector).

Learning Rate and Stability

Different components often use different learning rates. A small learning rate goes to the well-pretrained vision encoder and LLM, while a relatively larger one goes to the projector that learns from scratch. This preserves good representations while quickly learning only the alignment.

Deep Dive 2: Data — What You Show

A VLM's ability is determined by its training data. The main data types are as follows.

Image-caption: an image and its description. The workhorse of alignment pretraining. Large-scale web-crawled data is common.
Document/OCR: images containing text and that text. Builds the ability to read small text, tables, receipts, and forms.
VQA (visual question answering): questions about an image and their answers. Trains reasoning and instruction following together.
Grounding/coordinates: object locations (box coordinates) in an image and their names. Gives the ability to point at and refer to objects.
Chart/table understanding: chart or table images and their structured interpretation. Builds numeric reasoning.

The ratio and quality of data govern the result. Increase OCR data and it reads documents well, but skew too far and general conversation ability drops. Adjusting the data mix ratio per stage is a key practical skill.

Deep Dive 3: Input Format — How You Show It

The training input is a sequence of interleaved image tokens and text tokens, usually wrapped in a chat template.

Chat Template and Image Placeholder

The input is structured with system/user/assistant roles, with a placeholder token where an image goes. In preprocessing that slot is replaced by the actual visual tokens.

system: You are a helpful vision assistant.
user: [IMAGE] What is the total on this receipt?
assistant: The total is 32,500 won.

The IMAGE slot is replaced by a visual token block just before training. With multiple images, the corresponding block goes into each image slot.

Diverse Output Shapes

A VLM's output is not text only. The answer format of the training data becomes the output format the model learns.

Natural-language text: captions, answers, explanations.
Coordinates: in grounding tasks, box coordinates expressed as text tokens (e.g., a sequence of normalized numbers). The model generates coordinates like characters.
Structured output: JSON, markdown tables, etc. In document parsing the model is taught to generate key-value or table structure directly.

To get structured output, the training answers must also have the same structure. Since the model imitates the answer format, producing data in a consistent format is important.

Deep Dive 4: Loss — What You Learn Against

A VLM's training objective is next-token prediction, the same as an LLM. That is, learn to raise the probability of the next token at each position in the sequence.

loss = - sum over target positions of log P(token_t | tokens_<t)

The key is where the loss is computed. Usually the image tokens and the user prompt portion are excluded (masked) from loss computation, and loss is computed only on the assistant's response tokens.

[system] [user text] [IMAGE tokens] [assistant response]
   X          X           X(masked)        O(loss computed)

The reason is clear. We do not want the model to generate images or memorize the user's question; we want it to generate good responses. So we give the learning signal only on the target assistant response. Image tokens are usually not a generation target, so they are excluded from loss.

Loss Masking in Pseudocode

def compute_loss(logits, labels, loss_mask):
    # logits: B x T x V, labels: B x T, loss_mask: B x T (1=learn, 0=ignore)
    shift_logits = logits[:, :-1, :]
    shift_labels = labels[:, 1:]
    shift_mask = loss_mask[:, 1:]

    token_loss = cross_entropy(
        shift_logits.reshape(-1, shift_logits.size(-1)),
        shift_labels.reshape(-1),
        reduction="none",
    )
    token_loss = token_loss.reshape(shift_labels.shape)

    # masking: image/user token positions = 0, assistant response = 1
    masked = token_loss * shift_mask
    return masked.sum() / shift_mask.sum().clamp(min=1)

How you build the loss_mask governs training quality. Marking only the assistant response as 1 and blocking image tokens and the prompt as 0 is a common starting point.

Full Training Step in Pseudocode

def training_step(batch, model):
    # batch: image tensors + tokenized interleaved sequence + loss_mask
    visual_tokens = model.vision_encoder(batch["images"])      # B x N_v x D_vis
    visual_tokens = model.projector(visual_tokens)             # B x N_v x D_llm

    # insert visual tokens at placeholder positions
    inputs_embeds = model.embed_and_inject(
        batch["input_ids"], visual_tokens, batch["image_positions"]
    )

    logits = model.llm(inputs_embeds=inputs_embeds,
                       attention_mask=batch["attention_mask"])

    loss = compute_loss(logits, batch["labels"], batch["loss_mask"])
    return loss

Deep Dive 5: Teaching Grounding and Coordinates as Text

How does a VLM learn the grounding ability to point at objects? The key is expressing coordinates as text tokens without a special module. Normalize box coordinates into an integer range like 0 to 999, then teach the model to output those numbers like a string.

input:  [IMAGE] Mark the dog with a box.
target: The dog is located at (x1=120, y1=340, x2=410, y2=720).

To the model, coordinates are just another token sequence. Generate the numbers by next-token prediction and they become a location. This lets you learn grounding with the same language-modeling loss, with no separate detection head.

Coordinate normalization: normalizing into a fixed range regardless of image size makes learning consistent across resolutions.
Bidirectional tasks: training both the task of outputting coordinates (mark the object) and the task of taking coordinates as input (what is in this region) hardens spatial understanding.
Format consistency: unify the coordinate notation across the whole dataset so the model stably generates numbers.

Deep Dive 6: Data Mix and Curriculum

What data you mix, and how much, per stage governs the final ability. This is called the data mix or curriculum.

alignment-stage mix (example)
  image-caption      most
  simple VQA         small

multitask-stage mix (example)
  image-caption      some
  VQA                substantial
  OCR/document       substantial
  grounding          some
  chart/table        some

instruction-stage mix (example)
  conversational instruction  most
  format-adherence examples   some
  refusal/safety examples     small

The core intuition is to raise data diversity and quality as stages progress. Early on, fix alignment with quantity; later, refine behavior with quality. If a specific ability is weak, raise that data's share, but always keep in mind the balance problem where over-raising one side degrades another ability.

Interference Between Abilities

Raise OCR data a lot and it reads documents well but free conversation can stiffen. Conversely, with only conversation data, precise coordinates or table extraction weakens. This inter-ability interference is common in VLM training, and the key in practice is to adjust the mix while tracking evaluation sets split by ability.

Deep Dive 7: Efficient Fine-tuning and Parameter Strategy

Training the whole model every time is expensive. So efficient fine-tuning strategies that train only a part are widely used.

Train the projector only: the lightest adaptation. Useful for quickly fitting a new vision encoder or domain.
LoRA-style low-rank adapters: instead of directly changing the weights of the LLM or vision encoder, train only small low-rank matrices added on. Memory and storage cost drop greatly, letting you swap several adapters on the same base model.
Partial freezing: freeze the vision encoder and train only the LLM and projector, balancing stability and cost.

fine-tuning cost comparison (intuition)
  full training        expensive,   maximum flexibility
  partial training     medium,      balanced
  LoRA-style adapter   cheap,       fast domain adaptation
  projector only       very cheap,  limited adaptation

For narrow tasks like domain-specific document extraction, LoRA-style is often enough, while changing the model's fundamental ability requires unfreezing more parameters. Choosing the strategy to match the task's nature and budget is important.

Deep Dive 7.5: Chat Template and Special Tokens

When constructing input, the chat template is not just a format but part of the learning signal. The model distinguishes roles by the template's special tokens as boundaries.

typical chat template structure
  <role:system> system instruction <role-end>
  <role:user> [image slot] user question <role-end>
  <role:assistant> model response <role-end>

Role boundary tokens: special tokens that separate system/user/assistant. These boundaries are needed to paint the loss mask correctly.
Image placeholder: a token marking where an image goes. Replaced by visual tokens in preprocessing.
Response end token: a token that teaches the model when to stop. Without it generation never ends.

A common mistake here is the chat template subtly differing between training and inference. Even one space or one token off and the model perceives a different input than at training, hurting performance. Training and inference must share the same template function.

Deep Dive 7.6: Training Stability and Common Failures

VLM training is more prone to instability than text training, because two kinds of data and three kinds of parts are entangled. Here are commonly seen failure patterns.

Loss divergence: unfreezing the vision encoder with too large a learning rate makes loss spike. Handle the encoder carefully with a low learning rate.
Modality collapse: the model ignores the image and answers from text only. Mix in enough data where the image is essential to the answer.
Repetitive generation: responses repeat the same phrase infinitely. Check data quality and end-token training.
Format ignoring: it emits free text even when structured output is requested. Increase format-adherence examples and secure consistency.

failure -> items to check
  loss divergence      learning rate, especially vision encoder lr
  modality collapse    share of image-essential data
  repetitive generation end token, data quality
  format ignoring      number of format examples, schema consistency

Such failures mostly stem from the data mix, learning rate, or mask setting. Checking these three first, before changing the model structure, is efficient.

The Alignment Stage: RLHF and DPO Briefly

Once instruction fine-tuning is done the model follows instructions. Add an alignment stage that reflects human preferences and the response's usefulness and safety improve.

RLHF: train a reward model that gives higher reward to responses humans prefer, then optimize the policy (the model) to that reward via reinforcement learning. Powerful but the pipeline is complex.
DPO: optimize the policy directly from preference pairs (good response, bad response) without a separate reward model. Simple and stable to implement, so widely used.

The same principle applies to VLMs. The difference is that preference data includes images, and preferences are often collected in a direction that reduces hallucination (saying things not in the image).

Comparison Table: Training Stages at a Glance

Stage	Main goal	Data	Mainly trained	Data amount/quality
Alignment pretraining	vision-language alignment	image-caption	projector	large amount, moderate quality
Multitask	broaden ability	mixed multitask	projector+LLM	medium
Instruction tuning	instruction following	instruction-response	LLM	small amount, high quality
Alignment (RLHF/DPO)	reflect preference	preference pairs	LLM	small, very high

Loss location	Masked?	Reason
Image tokens	masked (excluded)	not a generation target
System/user text	masked (excluded)	do not memorize the question
Assistant response	learned (included)	generating good responses is the goal

Deep Dive 8: Evaluation and Training Monitoring

You cannot tell whether training is going well from the loss curve alone. A VLM has multiple strands of ability, so you must track from many angles with per-ability evaluation sets.

monitoring dashboard (example)
  caption quality      validation-set score trend
  VQA accuracy         question-answer correctness
  OCR/document accuracy field-extraction accuracy
  grounding accuracy   coordinate match
  conversation quality instruction-adherence eval
  hallucination rate   frequency of ungrounded responses

A common pitfall here is driving training by a single metric. Push only OCR accuracy and at some point conversation quality may have collapsed. It is important to watch several metrics together and monitor that when one improves another does not worsen.

Early signals: if a specific ability's score plunges early in training, there may be a problem in the data mix or loss mask.
Overfitting: when validation scores stall or start dropping, it is time to increase data diversity or stop training.
Distribution check: periodically check whether the evaluation set represents the actual usage distribution.

Deep Dive 9: Multimodal Batching and Training Efficiency

VLM training is trickier in batch construction than text-only training. The visual token count differs per image (arbitrary resolution) and text lengths vary, so sequence lengths are uneven.

batch construction difficulty
  sample1: short text + small image (256 tokens)   -> total 300 tokens
  sample2: long text + large image (6000 tokens)   -> total 6500 tokens
  sample3: text only                               -> total 200 tokens

  with naive padding: pad to the longest sample -> large waste

There are techniques to reduce this inefficiency.

Length-based bucketing: gather samples of similar length into one batch to reduce padding waste.
Sequence packing: concatenate several short samples into one sequence, but use an attention mask to block boundaries between samples so they do not mix.
Image token cap: cap tokens per image under arbitrary resolution so an extremely long sample does not dominate the batch.

Training efficiency is cost. The more effective tokens you learn per GPU hour, the better the cost-performance, so batch construction is not a mere engineering detail but the core of training economics.

Deep Dive 10: Distributed Training and Memory

A large VLM does not fit on a single GPU. So distributed training strategies are needed.

Data parallel: place the same model replica on several GPUs, train on different data, then sum the gradients. The simplest.
Sharding (parameter splitting): split model parameters, optimizer states, and gradients across GPUs to distribute the memory burden. The standard for large-model training.
Activation checkpointing: do not store intermediate forward activations; recompute them during backprop to save memory. A trade-off that uses more compute to save memory.

main consumers of memory
  model parameters     fixed
  optimizer states     several times the parameters (depends on optimizer)
  gradients            same size as parameters
  activations          proportional to batch/sequence length (includes visual tokens)

The thing to especially watch in a VLM is activations. When visual tokens lengthen the sequence, activation memory spikes. So image token caps and activation checkpointing become important in large-image training.

Deep Dive 11: Synthetic Data and Data Augmentation

High-quality labeled data is expensive. So techniques to augment training data with synthetic data are widely used.

Template-based generation: generate tables or forms programmatically and create the answers alongside. Effective for tasks with clear structure.
Recaptioning with an existing model: re-caption images more accurately with a strong model to reduce noise.
Rendering synthesis: render text with diverse fonts, backgrounds, and resolutions to make endless OCR data. Watch for the domain gap.

synthetic data usage flow
  real data (small, high quality)  +  synthetic data (large, controllable)
   |
   v
  mixed training  ->  validate on real distribution  ->  check domain gap

The pitfall of synthetic data is the domain gap. Train only on too-clean synthetic images and it collapses on real noisy documents. Mixing synthetic and real appropriately, and always evaluating on real data, is safe.

Data Curation and Pitfalls

Noisy captions: web-crawled image-captions are often inaccurate or irrelevant. Filtering and recaptioning to raise quality has large effects.
Data imbalance: skewing too far toward a specific task (e.g., OCR) degrades other abilities. Monitor the mix ratio per stage.
Hallucination-inducing data: training heavily on captions that assert things not in the image makes the model hallucinate. Curating toward grounded responses is important.
Format mismatch: if the output format of training answers is uneven, the model cannot produce consistent structure. Fix the schema for JSON/table outputs.
Preprocessing mismatch: if image preprocessing (patch size, normalization) differs between training and inference, performance drops. Share the same pipeline on both sides.
Loss mask bugs: if the loss_mask is wrong and loss leaks on the prompt or image tokens, training goes in the wrong direction. Always validate the mask.

Closing

The essence of VLM training is defining inputs and outputs in stages. Alignment pretraining puts vision and language into the same space, multitask broadens the range of abilities, instruction tuning makes it follow instructions, and alignment reflects human preferences. Between them you turn the knobs of vision encoder freezing/unfreezing, data mix ratio, and loss masking to grow the abilities you want.

There is one core thing to remember. The model imitates the input you show and the output you treat as the target. Teach coordinates and it produces coordinates; teach tables and it produces tables. So the starting point of training is to decide which ability you want, then design the data with the input format and output shape that exactly reflect that ability.

References

Qwen2-VL: Enhancing Vision-Language Model's Perception (arXiv: 2409.12191) — arxiv.org/abs/2409.12191
Attention Is All You Need (arXiv: 1706.03762) — arxiv.org/abs/1706.03762
FlashAttention: Fast and Memory-Efficient Exact Attention (arXiv: 2205.14135) — arxiv.org/abs/2205.14135
Qwen official repository — github.com/QwenLM
Hugging Face Transformers docs — huggingface.co/docs
PyTorch official docs — pytorch.org
vLLM docs — docs.vllm.ai
vLLM repository — github.com/vllm-project/vllm