Multimodal AI Training Methods — Many Senses in One Model

Introduction: Why Multimodal
What Multimodality Means: The Inputs
Core Principle: Aligning Modalities Into One Space
In Depth: Fusion Strategies
Training Pipeline: From Pretraining to Fine-Tuning
- Stage 1: Large-Scale Pretraining
- Stage 2: Multitask / Instruction Fine-Tuning
Data: Scale and Quality
Evaluation: What and How We Measure
Limitations and Hallucination
A Closer Look at Contrastive Learning
Worked Example: How Zero-Shot Classification Works
Connection to Generative Multimodal
Training Stability and Scaling
Considerations for Multilingual and OCR
Practical Application Scenarios
Comparison: Alignment and Fusion Styles
Conclusion
References

Introduction: Why Multimodal

People do not perceive the world through a single sense. We read a caption while looking at a photo, and we hear sound while watching a video. The same concept may arrive as an image or as words, yet our minds ultimately fuse it into one meaning. Multimodal AI tries to implement this fusion inside a model.

Traditional deep learning models handled one modality at a time. An image classifier saw only images; a language model saw only text. But real problems cross modality boundaries. A question like "what is happening in this photo?" requires understanding image and text together. Document understanding, video retrieval, voice assistants, and robotics all increasingly demand a multimodal approach.

This article covers how to bind several senses (modalities) into one model. The crux is aligning different modalities into a shared semantic space, and then fusing those representations effectively. We move from CLIP-style contrastive learning, through pretraining and fine-tuning, to data scale and quality, evaluation, and limitations such as hallucination.

What Multimodality Means: The Inputs

First, what is a modality? A modality is the sensory channel or format through which data arrives. The common ones are:

Image: a pixel grid with spatial structure.
Text: a sequence of discrete, ordered tokens.
Audio: a waveform or spectrogram over time; the time axis is central.
Video: a sequence of image frames, often plus audio; space and time combine.

A multimodal model handles two or more modalities as input or output. Vision-language models (VLMs) that read an image and text together, text-to-image generators, and speech-to-text models all fall under the multimodal umbrella.

Different modalities have very different statistics. Images are continuous, high-dimensional, and locally correlated; text is discrete with strong long-range dependencies. Handling this heterogeneity is the starting point of multimodal learning.

Representation form per modality (conceptual)

Image  : patch grid     -> sequence of patch embeddings
Text   : token sequence -> sequence of token embeddings
Audio  : spectrogram    -> sequence of frame embeddings
Video  : frame bundle   -> sequence of spatiotemporal embeddings

Key: everything becomes "a sequence of embedding vectors"
     that flows into a shared backbone (e.g., a Transformer).

Core Principle: Aligning Modalities Into One Space

The most fundamental idea in multimodal learning is mapping different modalities into a shared embedding space. We train so that an image and a text with the same meaning land close together in that space.

What a Shared Embedding Space Is

Each modality has its own encoder. The image encoder turns an image into a vector; the text encoder turns a sentence into a vector. When both encoders' outputs live in a space of the same dimension and semantically corresponding pairs are pulled close, the model can compare across modalities.

Shared embedding space (concept)

  image encoder ──> z_image  ─┐
                              ├─> vector space of common dimension D
  text encoder  ──> z_text   ─┘

  Goal: matching (image, text) pairs are close;
        non-matching pairs are far.

Contrastive Learning: CLIP-Style Image-Text Contrast

The representative technique for learning this alignment is contrastive learning. CLIP-style models gather large numbers of (image, caption) pairs. Within a batch, the correct pair is a positive and all other combinations are negatives; the model raises the similarity of positives and lowers that of negatives.

Similarity is usually the dot product of normalized embeddings (cosine similarity). The loss is a symmetric InfoNCE form covering both the image-to-text and text-to-image directions.

Contrastive loss (written without dollar signs)

Batch size N, embedding dim D
Image embeddings I in R^(N x D), text embeddings T in R^(N x D)
L2-normalize each row.

Logit matrix:  logits = (I dot T^T) * exp(tau)
               shape = N x N
Diagonal entries (i==j) are the correct pairs.

Image->text loss:  CE(logits,    labels=[0..N-1])
Text->image loss:  CE(logits^T,  labels=[0..N-1])
Final loss = (loss_i2t + loss_t2i) / 2

tau: learned temperature parameter (log scale)
CE: cross entropy

A strength of contrastive learning is that it needs no manual labeling. Images and surrounding text from the web (alt text, captions) enable weakly supervised learning. The result is strong transfer, such as zero-shot classification: just compare the embedding of a text prompt ("a photo of a cat") with the image embedding.

Other Axes of Alignment

Contrastive learning is not the only path. Generative objectives such as matching or captioning (generating text from an image) also connect modalities. Large models in practice mix contrastive, generative, and masked-reconstruction objectives. In every case the central question is the same: how do we make representations from different modalities share the same meaning?

In Depth: Fusion Strategies

If alignment brings representations close, fusion actually combines information from several modalities into a single decision. By the timing of fusion, we distinguish three broad styles.

Early Fusion

Combine modalities into one sequence at the input stage and feed them to the same backbone. Concatenating image patch embeddings and text token embeddings, then passing them through a single Transformer, is the canonical example.

Early fusion

[img_patch_1 ... img_patch_M, txt_tok_1 ... txt_tok_K]
        |
   single Transformer (self-attention over everything)
        |
   joint representation -> output

The advantage is rich cross-modal interaction from the start. The drawback is a longer input sequence (more compute) and the need to unify preprocessing across modalities.

Late Fusion

Process each modality with an independent encoder all the way through, then combine the representations at the end (concatenation, average, weighted sum). CLIP's inference stage is effectively late fusion. Independent encoders help modularity and caching, but fine-grained interaction is limited.

Late fusion

image -> image encoder -> z_image ─┐
text  -> text encoder  -> z_text  ─┤-> combine (concat/dot) -> output

Cross-Attention Fusion

Encode the modalities separately but insert cross-attention layers so one modality can attend to another. A common pattern is a text decoder attending to image features as key/value. Modern VLMs with an adapter/projector plus an LLM decoder also broadly belong here.

Cross-attention fusion

image -> vision encoder -> vision features (key/value source)
                                 |
text  -> LLM decoder -- cross-attn --> attend to vision features
                                 |
                             output tokens

The three styles are not mutually exclusive. A practical model embeds the image with a vision encoder (late flavor), projects it into the LLM input space, and mixes it with text inside the LLM (early/cross flavor).

Vision-Language Adapter/Projector

The typical modern VLM is built like this. A ViT-style vision encoder turns the image into patch features, and an adapter (projector) maps them into the LLM's token embedding space. The adapter may be a plain MLP, or a Q-Former-style module that compresses information with learnable queries. The image then enters the LLM sequence as "virtual tokens" processed alongside text.

Modern VLM pipeline (concept)

image -> ViT vision encoder -> patch features (M x D_v)
                              |
                         projector/adapter (MLP or Q-Former)
                              |
                         visual tokens (M' x D_llm)
                              |
[visual tokens ... text tokens] -> LLM decoder -> response

Training Pipeline: From Pretraining to Fine-Tuning

Training a multimodal model usually splits into two stages.

Stage 1: Large-Scale Pretraining

Learn alignment and basic representations from large weakly supervised data (web image-text pairs and the like). Here the vision encoder is often frozen or only partly trained, while the adapter and parts of the LLM are aligned first. The goal is to pull modalities into the same space.

Stage 2: Multitask / Instruction Fine-Tuning

Gather many tasks (VQA, captioning, document understanding, OCR-based reasoning) in instruction format and fine-tune. Here the vision encoder may be unfrozen for finer fitting. Carefully scheduling freeze/unfreeze is key to stable convergence.

Below is pseudocode for the training loop, meant to show the flow rather than a real implementation.

# Multimodal contrastive pretraining pseudocode (conceptual)
import torch
import torch.nn.functional as F

def clip_step(images, texts, image_encoder, text_encoder, logit_scale, optimizer):
    # 1) encode each modality
    img_feat = image_encoder(images)      # (N, D)
    txt_feat = text_encoder(texts)        # (N, D)

    # 2) L2-normalize -> cosine-similarity based comparison
    img_feat = F.normalize(img_feat, dim=-1)
    txt_feat = F.normalize(txt_feat, dim=-1)

    # 3) similarity logits (with temperature)
    scale = logit_scale.exp()
    logits_per_image = scale * img_feat @ txt_feat.t()   # (N, N)
    logits_per_text = logits_per_image.t()

    # 4) symmetric cross entropy with the diagonal as the target
    n = images.size(0)
    labels = torch.arange(n, device=images.device)
    loss_i2t = F.cross_entropy(logits_per_image, labels)
    loss_t2i = F.cross_entropy(logits_per_text, labels)
    loss = (loss_i2t + loss_t2i) / 2

    # 5) backprop
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

# Freeze/unfreeze control across fine-tuning stages (conceptual)
def configure_trainable(model, stage):
    if stage == "align":
        # stage 1: freeze vision encoder, train adapter mostly
        for p in model.vision_encoder.parameters():
            p.requires_grad = False
        for p in model.adapter.parameters():
            p.requires_grad = True
    elif stage == "instruct":
        # stage 2: unfreeze part of vision encoder + fine-tune LLM
        for p in model.vision_encoder.parameters():
            p.requires_grad = True
        for p in model.llm.parameters():
            p.requires_grad = True
    return model

Data: Scale and Quality

Multimodal performance hinges on data. Watch two axes.

Scale: web-scale image-text pairs give diversity but are noisy. Wrong captions, irrelevant alt text, and duplicates are common.
Quality: filtering (similarity-based cleaning), deduplication, safety filters, and re-captioning raise quality. Recently, augmenting data with model-generated high-quality captions is also common.

A data-side trap is distribution bias. Skewing toward certain languages or cultures degrades performance elsewhere. For OCR or document understanding, include enough text-rich images.

Data pipeline (concept)

raw collection -> dedup -> similarity filter (image-text match) ->
safety filter -> re-caption/augment -> shuffle/shard -> training

Evaluation: What and How We Measure

Evaluation is task-specific.

Zero-shot classification/retrieval: classifying with unseen labels, or image-text retrieval accuracy (Recall@K).
VQA: accuracy on image-grounded question answering.
Captioning: caption quality (reference-based metrics and human evaluation).
Document/OCR understanding: reading and reasoning over tables, charts, and documents.
Hallucination measurement: dedicated benchmarks for inventing objects not present in the image.

A single metric misleads easily. Automatic metrics can look good while humans find the output unnatural, and vice versa. Pairing automatic metrics with human evaluation is standard.

Limitations and Hallucination

A signature limitation is hallucination. The model may mention objects not in the image, or misread the image under the pull of textual priors. Causes are varied.

Excessive language prior: a strong LLM decoder generates plausible answers without image grounding.
Insufficient alignment: when vision features are not aligned into the LLM space, details are missed.
Data noise: training on wrong captions learns wrong associations.

Mitigations include stronger alignment training, objectives that enforce image grounding, hallucination-aware data cleaning, and prompt designs that force explicit reference to image regions at inference. Even so, full removal is hard; applications should assume uncertainty.

Other limitations include the cost of high-resolution or long-video processing, modality imbalance (less video data than text), and the difficulty of evaluation.

A Closer Look at Contrastive Learning

Contrastive learning is the engine of multimodal alignment. But training it well hinges on a few details. The loss looks simple on the surface, yet results vary greatly with the number of negatives, the temperature, and batch composition.

Negatives and Batch Size

In the InfoNCE loss, the number of negatives comes from batch size. With N pairs in a batch, each positive automatically gets N-1 negatives. So a larger batch poses a harder contrastive problem and sharpens the representation. This is why contrastive learning favors large batches.

Batch size and negative count (concept)

batch N=256  -> 255 negatives per positive
batch N=4096 -> 4095 negatives per positive

More negatives:
  - harder contrast -> tends to improve representation quality
  - higher memory demand
Response: gather embeddings across devices to expand the negative pool

Gathering embeddings across multiple GPUs to enlarge the negative pool is common. Instead of each device seeing only its own batch, sharing all embeddings as negatives gives an effectively larger batch.

The Role of Temperature

Temperature controls the sharpness of the similarity distribution. Low temperature (large scale) pressures the model to separate positives and negatives more strongly; high temperature softens the distribution. Making temperature a learned parameter lets the model find an appropriate sharpness on its own.

Effect of temperature (concept)

logits = (normalized embedding dot product) * scale,  scale = exp(tau)

large scale (low temp): sharp distribution -> strong contrast, risk of instability
small scale (high temp): soft distribution -> weak contrast

In practice: make tau learned but cap it to prevent runaway

Hard Negatives

Not all negatives are equally useful. Easily separated negatives (totally unrelated pairs) give a weak training signal, while confusable negatives (similar meaning but different pairs) give a strong one. Deliberately including such hard negatives improves fine discrimination. The catch: with label noise, the risk of mistaking a true positive for a negative grows.

Worked Example: How Zero-Shot Classification Works

Let us trace concretely how a model aligned by contrastive learning classifies without labels. The key is reframing classification as an "image-text retrieval" problem.

Zero-shot classification procedure (concept)

1) build a prompt per candidate class
   "a photo of a cat", "a photo of a dog", ...
2) embed each prompt with the text encoder -> t_1, t_2, ...
3) embed the input image -> v
4) compute cosine similarity of v with each t_k
5) predict the most similar class

Key: even without seeing that label during training,
     classification works because meaning is aligned in text space

The same procedure applies to retrieval. Since image and text embeddings share a space, both text-to-image and image-to-text retrieval are possible. The power of alignment lies precisely in this cross-modal ability.

Connection to Generative Multimodal

We have focused on understanding, but another major axis of multimodal is generation — creating an image from text, or writing a long description from an image. The principles of alignment and fusion operate identically in generation.

Image -> text generation (captioning, VQA): inject vision features into the LLM, which autoregressively generates text conditioned on them. The adapter+LLM structure from earlier applies directly.
Text -> image generation: generate an image conditioned on a text embedding. Diffusion models or VQ-based autoregressive generation are representative. Here, text-image alignment quality governs output fidelity.

Understanding vs generative (concept)

understanding: [image+text] -> understand/judge -> text answer
generative (captioning): [image] -> condition -> autoregressive text
generative (T2I): [text] -> condition -> image generation (diffusion/VQ AR)

Common: modality alignment quality is the foundation for all tasks

The two axes are not separate; they stand on the same foundation of aligned representations. Good alignment lifts both understanding and generation at once.

Training Stability and Scaling

Large-scale multimodal training easily becomes unstable. Different modalities have different loss scales, the vision encoder and LLM learn at different rates, and data noise makes the loss fluctuate. Several practical devices help stabilize training.

Separate learning rates: give different learning rates to the vision encoder, adapter, and LLM. Usually the freshly initialized adapter gets a high rate and the pretrained vision encoder a low one.
Warmup and schedule: warmup early to soften large gradients, then stabilize the later phase with cosine decay and the like.
Gradient clipping: prevent gradient runaway caused by noisy data.
Staged unfreezing: with the freeze/unfreeze schedule seen earlier, release more parameters only after alignment stabilizes.

Per-module learning rate example (concept, relative)

adapter/projector : high (learned fresh)
LLM               : medium (fine-tuned)
vision encoder    : low (preserve pretraining)

Principle: train freshly initialized parts fast,
          parts that already have good representations slowly

On scaling, it is important to grow data, model, and compute in balance. Growing only one yields quickly diminishing returns. Without data quality, scaling the model alone does not improve alignment; conversely, with abundant data but insufficient capacity, the model cannot absorb it.

Considerations for Multilingual and OCR

Using a multimodal model for multilingual documents or OCR requires special care in data composition. Typical web image-caption data may not contain enough text-rich images (documents, tables, charts, signs).

Data augmentation for OCR/document understanding (concept)

general data only : mostly natural images -> weak text reading
after augmenting   : add documents/tables/charts/screenshots/signs
                   + multilingual text images
                   -> improved OCR-free document understanding

Key: secure a share of samples where the text to read is inside the image

For multilingual, if certain languages (especially non-Latin scripts) are underrepresented, OCR and understanding for them degrade. Language balance, script diversity, and deliberately including images containing that language's text matter.

Practical Application Scenarios

Let us see how the principles combine in real applications through a few scenarios.

Image captioning / alt-text generation: describe images with a vision encoder + adapter + LLM setup. Fine-tuning that emphasizes image grounding is important to reduce hallucination.
Document question answering: process documents at high/dynamic resolution and read tables/charts OCR-free to answer. The share of document images in the data is decisive.
Image-text retrieval: index embeddings aligned by contrastive learning for bidirectional retrieval. Embedding quality governs retrieval accuracy.
Multimodal assistant: a conversation interleaving several modalities. Context cost management and hallucination mitigation are the core challenges.

Each scenario is ultimately a different combination of the same two principles (alignment and fusion). Which modalities, how much aligned, and at what point fused determine the character of the application.

Comparison: Alignment and Fusion Styles

Aspect	Technique	Pros	Cons
Alignment goal	contrastive (InfoNCE)	weakly supervised, strong zero-shot transfer	weak for fine generation
Alignment goal	generative/captioning	fine language generation	indirect alignment signal
Fusion	early	rich interaction	long sequence, high cost
Fusion	late	easy modularity/caching	weak interaction
Fusion	cross-attn	balances cost and interaction	structurally complex

Conclusion

Multimodal learning reduces to two things. First, align different modalities into the same semantic space. Second, fuse those representations for the task at hand. Contrastive learning produced strong alignment from weakly supervised data alone, and modern VLMs that combine an adapter with an LLM extended this alignment into rich generation.

At the same time, hallucination and data bias remain open problems. Balancing scale and quality, scheduling freeze/unfreeze, and evaluating carefully are the practical levers for a good multimodal model. The next article covers how images and audio are actually turned into tokens and woven into one sequence — the specifics of tokenization and fusion.