- Published on
Analyzing SOTA Multimodal LLMs — One Model to See, Hear, and Speak
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction: Why Multimodal
- What Is a Modality
- The Core Idea: Everything Becomes Tokens
- Architecture Components
- A Closer Look at the any-to-any Flow
- Native Multimodal vs. Adapter Grafting
- Training Strategy: From Alignment to Instruction
- Tokenization, Resolution, Efficiency
- Representative Families, Conceptually
- Lineage: The Flow of Progress
- A Closer Look at Cross-Modal Attention
- Two Branches of Adapter Grafting: Prefix and Cross-Attention
- Audio and Speech: Two Branches
- Video: The Challenge of the Time Axis
- Benchmarks: What Is Measured and How
- Limitations and Open Problems
- Practical Implications
- Closing
- References
Introduction: Why Multimodal
A few years ago, the word "LLM" essentially meant text. A user typed a sentence and the model replied with a sentence, a pure language model and nothing more. But if you think about how humans understand the world, we never reason with text alone. We see scenes with our eyes, hear sounds with our ears, speak with our mouths, and draw with our hands. We weave all of these sensory channels, these modalities, into a single understanding and expression of the world.
Multimodal Large Language Models aim squarely at this. They integrate multiple modalities into a single large language model, extending its text-handling ability to images, audio, and video. Over the past few years this field has advanced rapidly, and the goal of a single model that can "see, hear, and speak" is steadily becoming reality.
This post lays out multimodal LLMs from the basic concepts through architecture, training strategy, representative model families, benchmarks, and limitations. Because AI moves extremely fast, I will emphasize concepts and architectural principles over specific rankings or the latest numbers. The detailed specifications of particular commercial models are often undisclosed, so I will treat them carefully, within what is reliably known.
What Is a Modality
Let us first settle the terminology. A modality is a format or sensory channel that carries information. The representative ones include:
- Text: natural-language sentences, code, formulas, and so on
- Images: photos, diagrams, screenshots, document scans, and the like
- Audio: speech, music, ambient sound
- Video: a time-ordered sequence of images together with its accompanying audio
"Multimodal" means handling two or more of these at once. For example, a model that takes an image as input and describes its content in text handles two modalities, image and text. Going one step further, freely combining arbitrary modalities in both input and output is called any-to-any: asking in text and answering with an image, listening to audio and summarizing in text, or looking at an image and describing it aloud.
The Core Idea: Everything Becomes Tokens
The single most important key to understanding multimodal LLMs is the concept of the "token." Originally an LLM splits text into tokens. It turns words or subwords into integer IDs, converts each ID into a high-dimensional vector (an embedding), and flows this sequence of vectors through a Transformer. The Transformer learns the relationships between tokens through self-attention.
The core insight of multimodality is simple. "Whether it is an image or audio, if we can only turn it into a sequence of vectors, we can feed it to the Transformer exactly like text tokens." In other words, as long as we map different modalities into a common unified token space, the LLM can process them the same way without caring whether they came from text or an image.
Thanks to this idea, the architecture of a multimodal LLM takes on the following shared skeleton.
[image] [audio] [text]
| | |
image encoder audio encoder tokenizer
| | |
projector projector embedding
| | |
+-------+-------+-------+-------+
|
unified token sequence
|
+-----------+
| LLM | <- Transformer backbone
| (decoder) |
+-----------+
|
output token sequence
|
+-------+-------+
| |
text decode image/audio decoder
This diagram is the typical blueprint of a multimodal LLM. Let us take apart each component one by one.
Architecture Components
1. Modality Encoder
Each non-text modality first passes through a dedicated encoder. The encoder's role is to compress the raw input (pixels, waveforms) into meaningful feature vectors.
For images, CLIP-family (Contrastive Language-Image Pre-training) vision encoders were long used as the de facto standard. CLIP is a model that aligns images and text into the same embedding space through contrastive learning. Because it already produces image representations that are compatible with text, it is well suited for grafting onto a language model. It uses a Vision Transformer (ViT) as its backbone, cutting the image into patches and treating each patch like a token.
For audio, a common approach is to convert the waveform into a mel spectrogram and pass it through a Transformer-based encoder. The encoder from the well-known Whisper family, used in speech recognition, is a representative example that has learned robust speech representations across many languages and noisy conditions.
For video, frames are encoded like images but the time axis must be taken into account as well. Frames are sampled at fixed intervals, each processed by an image encoder, and temporal position information is added, or 3D attention that looks at space and time jointly is applied.
2. Projector (Connector)
The feature vectors that the encoder produces still differ from the LLM's embedding space in both dimensionality and semantic distribution. The projector is the bridge that closes this gap. It takes the encoder output and transforms it into the LLM's token embedding space.
Projectors come in a few forms.
- Linear projection: the simplest, matching dimensions with a single matrix. Early LLaVA used this and produced surprisingly strong results.
- MLP: stacking several linear layers with nonlinear activations to increase expressiveness. Later widely adopted in improved LLaVA variants.
- Cross-attention-based resampler: for example, the Perceiver Resampler in Flamingo or the Q-Former in BLIP-2, where a small number of learnable query tokens extract information from the encoder features and compress it into a fixed number of tokens. This is useful for reducing the hundreds of patch tokens that a single image might otherwise produce, improving efficiency.
3. LLM Backbone
The multimodal tokens, now aligned into the token space, are placed alongside text tokens and enter the LLM backbone. This backbone is usually an already well-trained decoder-only Transformer. Because the language model has already learned world knowledge and reasoning ability from vast amounts of text, layering visual and auditory information on top lets it reuse that knowledge for visual question answering or audio understanding.
The key point is that tokens of different modalities interact through attention. For example, when answering "what is the person on the left holding in this photo," the text tokens (the question) and the image tokens (the photo's patches) reference one another within the same attention layers to produce the answer.
4. The Output Side: The Path to Generation
Up to here the story has been mostly about "understanding." Seeing an image and answering in text is a case where the input is multimodal but the output is text. To become truly any-to-any, the model must be able to produce non-text modalities on the output side as well.
There are roughly two approaches for this.
First, calling an external generation model like a tool. The LLM writes an image-generation prompt, hands it to a separate diffusion model, and returns the resulting image. This is simple to implement and lets you use each generation model's latest capabilities directly, but because the two models are loosely coupled, maintaining consistency can be difficult.
Second, having the model itself generate non-text tokens. You set up a codebook that represents images as discrete tokens and let the LLM generate text tokens and image tokens together within one sequence. The image tokens generated this way pass through a decoder (for example a VQ-VAE decoder or a diffusion decoder) to be restored into actual pixels. Audio can apply the same principle after a neural codec turns the waveform into discrete tokens.
A Closer Look at the any-to-any Flow
Let us sketch the ideal data flow of an any-to-any model a bit more concretely.
input (arbitrary modality mix)
text + image + audio
|
[per-modality encoding]
each modality to tokens
|
[interleaving]
"what instrument is this [audio]?"
mixing text and other-modality
tokens into one sequence
|
[LLM backbone processing]
cross-referencing via unified attention
|
[output routing]
if next token is text, to text;
if image token, to image decoder;
if audio token, to codec decoder
|
output (arbitrary modality mix)
text + image + audio
The concept of interleaving matters here. Early multimodal models handled simple pairs of one image with one caption, but real documents and conversations freely mix text and images. Think of a web page, with pictures inserted between paragraphs. Training on interleaved multimodal sequences lets a model handle this natural context.
Output routing is also an important design point, because the model must decide by itself at every step "which modality's token to emit next." A common approach is to place special boundary tokens (for instance, tokens marking the start and end of an image) to signal modality transitions.
Native Multimodal vs. Adapter Grafting
The philosophy of building a multimodal LLM splits broadly into two. This distinction is crucial for understanding the field.
Adapter Grafting (Late Fusion)
You take an already-completed, powerful text LLM and attach vision and audio encoders plus a projector to its front. Most of the LLM's weights are left as-is or only lightly adjusted, and mainly the projector and the encoder connection are trained.
The advantages are clear. It inherits the vast knowledge and language ability of the text LLM as-is, and training cost is relatively cheap. You can obtain usable visual understanding with comparatively little multimodal data. Open research families such as LLaVA, BLIP-2, and MiniGPT-4 achieved major results with this approach, and it was the driving force behind the multimodal boom in the open-source community.
The downside is that integration between modalities can be somewhat shallow. Because visual information enters the language model in a "translated" form, there may be limits to truly visually grounded reasoning.
Native Multimodal (Early Fusion)
This trains multiple modalities together from the start. From the pre-training stage, text, image, and audio data are mixed to train a single model. Because the boundaries between modalities blur from early in training, deeper and more natural integration is thought to be possible.
When several recent commercial frontier models are introduced as "designed to be multimodal from the ground up," that can be understood as aiming in this direction. That said, the exact internal structure of each model is often undisclosed, so the details are hard to state with certainty.
The advantages are deeper cross-modal reasoning, lower latency (especially in voice conversation), and smooth transitions between modalities. The disadvantages are enormous training cost and the difficulty of balancing the data.
Comparing the two in a table:
| Aspect | Adapter Grafting (Late Fusion) | Native Multimodal (Early Fusion) |
|---|---|---|
| Starting point | completed text LLM | multimodal pre-training from scratch |
| Training cost | relatively cheap | very large |
| Integration depth | can be shallow | deep |
| Data needed | little | very much |
| Representative cases | open research families | recent frontier families |
| Strengths | fast to build, knowledge reuse | deep reasoning, low latency |
In practice, various middle grounds exist between these two extremes. A hybrid approach that starts from a text LLM but then goes through large-scale multimodal pre-training again is also common.
Training Strategy: From Alignment to Instruction
Training a multimodal LLM is generally divided into several stages. Let us look at a typical pipeline.
Stage 1: Alignment Pre-training
The goal of the first stage is to align the non-text encoder's output with the LLM's language space. Using large amounts of image-caption pairs (an image and a sentence describing it), the model learns to look at an image and generate a caption. At this stage the projector is mainly trained, while the encoder and the LLM body are frozen or adjusted only minimally.
Through this process the projector learns "how to move this image feature vector into an embedding the LLM can understand." In effect, you are training a translator between modalities.
Stage 2: Instruction Tuning
Alignment alone may let the model caption well, but it will not follow the user's varied instructions. So the second stage tunes on multimodal instruction data. It learns diverse instruction-response pairs such as "what is the highest value in this graph," "express the mood of this photo as a poem," and "find the total amount in this document."
Only at this stage does the model take on the character of a conversational assistant. A well-known case is LLaVA effectively carrying out this stage using visual instruction data synthesized by a GPT-family model.
Stage 3: Alignment and Preference Optimization (Optional)
As with text LLMs, an additional stage to align with human preferences is sometimes added. Techniques such as RLHF or DPO are extended to the multimodal setting so the model produces answers that are more helpful, safer, and less prone to hallucination. In multimodal settings in particular, visual hallucination, insisting that something absent from the image is present, is a problem, so optimization aimed at reducing it is important.
Summarizing the overall training flow:
[Stage 1] alignment pre-training
large-scale image-caption learning
projector-focused, backbone frozen
|
v
[Stage 2] instruction tuning
diverse instruction-response data
transforms into conversational assistant
|
v
[Stage 3] preference optimization (optional)
RLHF / DPO and the like
reduced hallucination, safety, helpfulness
Tokenization, Resolution, Efficiency
One of the most vexing practical issues in multimodal LLMs is that non-text inputs consume too many tokens.
Take images. If you cut a high-resolution image into patches, the number of patches grows rapidly. Divide an image into a small grid, and you get as many tokens as grid cells, which leads directly to more attention computation. Because Transformer attention costs scale with the square of the sequence length, more tokens quickly increase compute and memory pressure.
Several techniques address this problem.
- Token compression with resamplers: summarizing an image into a fixed, small number of tokens, as with the Q-Former or Perceiver Resampler mentioned earlier.
- Dynamic resolution: tiling based on the image's aspect ratio and size, processing each tile, and combining them. Useful for high-resolution documents and tables.
- Token pooling/merging: merging adjacent similar tokens to reduce the count.
Audio and video make this worse. Video multiplies image tokens by the number of frames, so even a clip of a few seconds can explode the token count. Compromises such as adjusting the frame sampling interval or merging tokens along the time axis are therefore essential.
Finding this balance between efficiency and performance is one of the core challenges in multimodal LLM design.
Representative Families, Conceptually
Now let us look at representative model families with a focus on concepts. Again, the detailed specs and rankings of commercial models vary greatly by time and version, so here I focus on architectural ideas and widely known characteristics.
Open Research Families: CLIP, Flamingo, BLIP-2, LLaVA
This lineage laid the conceptual foundation of multimodal LLMs.
- CLIP: a model that aligns images and text into the same space through contrastive learning. It was later reused as the vision encoder in countless multimodal models.
- Flamingo: an early landmark that inserted cross-attention layers between a pre-trained vision encoder and a language model, and demonstrated few-shot ability on interleaved image-text.
- BLIP-2: proposed connecting a frozen image encoder and an LLM efficiently through a lightweight bridge module called the Q-Former.
- LLaVA: set an open-source multimodal standard with a concise recipe of connecting a CLIP vision encoder and a language model via a simple projector and tuning on synthetic visual instruction data.
Commercial Frontier Families (Conceptual)
The following names are widely referenced as recent frontier families "aiming for native multimodality." Concrete performance rankings and internal structures are largely unconfirmed officially, so I cautiously summarize only conceptual characteristics.
- GPT-4o class: known to handle text, image, and audio within one model and to emphasize near-real-time voice conversation. The "o" in the name was introduced as meaning "omni," spanning multiple modalities.
- Gemini class: a family introduced as designed to be multimodal from the start, known for handling long context and multiple modality inputs together.
- Qwen-VL class: an open-weights family emphasizing vision-language ability, widely used for document understanding, OCR, and precise grounding. Being open-weights, it is highly accessible for both research and practice.
Beyond these, many open and commercial families exist, each with different strong areas. You should always keep in mind that which one is "best" depends on the task, the benchmark, and the point in time.
Lineage: The Flow of Progress
Roughly summarizing the flow of progress in multimodal LLMs gives the following story.
[contrastive alignment]
CLIP family: joint image-text embedding
|
v
[encoder + LLM grafting]
Flamingo, BLIP-2: connected via bridge modules
|
v
[concise instruction recipe]
LLaVA family: projector + visual instruction
|
v
[aiming for native multimodal]
frontier families: multimodal by design
|
v
[any-to-any expansion]
image/audio generation on output too
arbitrary modality I/O in a unified token space
The broad direction of this flow is from "loose grafting" to "deep integration," and from "understanding-centric" to "unified understanding and generation." Early on the focus was on assembling existing parts; over time it moved toward training one model on several modalities from the start, and further toward inputting and outputting arbitrary modalities.
A Closer Look at Cross-Modal Attention
When we say a multimodal LLM actually "thinks by looking at an image," what happens inside is, in the end, attention. Looking a bit closer here reveals why the unified token space idea is so powerful.
Each layer of a decoder-only Transformer has self-attention. Attention is an operation in which each token forms a "query," compares it against other tokens' "keys," and pulls in more of the "value" of tokens with high similarity. When handling text only, these queries, keys, and values all come from words.
In the multimodal case, because image patch tokens and text tokens sit in the same sequence, a text token's query can reference an image token's key. That is, the text token "red umbrella" comes to attend to the patch tokens corresponding to the red region in the image. As this cross-referencing repeats across many layers, the meanings of text and image become increasingly entangled.
question tokens image patch tokens
[what] [holding] [is] [patch1] [patch2] ... [patchN]
| | | | | |
+------+------+----------+--------+------------+
|
self-attention layer
each token references all tokens
text attends to relevant patches
|
to next layer
Position information plays an important role here. Text has a one-dimensional order, but image patches sit on a two-dimensional grid. So image tokens are given a two-dimensional positional encoding so the model can know "where in the image this patch is." Only when position is conveyed well can the model answer spatial questions like "top left" or "bottom center."
Two Branches of Adapter Grafting: Prefix and Cross-Attention
Within adapter grafting, the method of feeding encoder information into the LLM again splits into two.
The first is the prefix method. Image tokens are simply inserted before or between text tokens to form one long sequence. The LLaVA family uses this; it is simple to implement and lets you use the LLM body almost as-is. The downside is that as image tokens grow, the sequence lengthens and the compute burden grows.
The second is the cross-attention insertion method. Separate cross-attention layers are placed between LLM layers so that text tokens reference image features, without including the image tokens themselves in the main sequence. Flamingo is the representative of this method. It can inject visual information without lengthening the sequence, an advantage for long context, but it requires modifying the LLM structure, making implementation complex.
Comparing the two in a table:
| Aspect | Prefix method | Cross-attention insertion |
|---|---|---|
| Image token position | inserted in main sequence | referenced via separate attention |
| Sequence length impact | grows | almost none |
| Implementation difficulty | simple | complex |
| Representative family | LLaVA | Flamingo |
| LLM change | minimal | added layers required |
Audio and Speech: Two Branches
The audio modality is often usefully split into two characters. One is speech (spoken voice), and the other is non-speech audio (music, ambient sound).
Speech understanding is deeply tied to automatic speech recognition (ASR). Turning the waveform into a spectrogram, passing it through an encoder, and connecting that representation to the LLM gives the ability to hear and understand speech. The Whisper-family encoder is widely reused for this. Conversely, to produce speech on the output side, you need neural TTS that turns text into speech, or a codec language model that handles audio as discrete tokens (the VALL-E concept).
Non-speech audio such as music or ambient sound is naturally handled by using a neural codec to compress the waveform into discrete tokens, then letting the LLM handle those tokens. Neural codecs like EnCodec or SoundStream form the basis of such discrete audio tokens. The MusicGen family showed the direction of generating such audio tokens autoregressively to make music.
In the recent trend that emphasizes real-time voice conversation, instead of going through the multiple stages of converting speech to text and text back to speech, there are ongoing attempts to handle speech directly as tokens to reduce latency. This is the reason native multimodality is said to have an advantage in voice conversation.
Video: The Challenge of the Time Axis
Video is one of the trickiest modalities in multimodal LLMs, because one more axis, time, is added to images.
The simplest approach is to view video as a sequence of frames, sample frames at fixed intervals, and encode each like an image. To this you add temporal position information telling the model which time step each frame is. The problem is the token explosion mentioned earlier. Feed several frames per second directly and even a clip of a few seconds swells to thousands of tokens.
So in practice several compromises are used. You widen the frame sampling interval to reduce frame count, merge tokens of adjacent frames, or apply spatiotemporal pooling that compresses information along the time axis. For long videos, an approach that divides by scene and stacks summaries hierarchically is also studied.
[original video]
many frames
|
frame sampling
select only representative frames
|
per-frame encoding
reusing the image encoder
|
temporal position + token merging
compress sequence length
|
pass to the LLM
Video understanding is difficult, so this area is still developing actively and is less mature than image understanding.
Benchmarks: What Is Measured and How
Benchmarks for measuring multimodal LLM performance vary by task. Let us organize the representative axes.
- Visual question answering (VQA): the ability to answer questions about an image, ranging from general-knowledge to fine-grained perception.
- Document, chart, and table understanding: the ability to read information from screenshots, scanned documents, and graphs. It requires both OCR and structural understanding.
- Visual reasoning: the ability to reason logically over multiple images or complex scenes.
- Grounding: the ability to precisely pinpoint where in the image a text-referenced target is.
- Audio and video understanding: the ability to grasp the content of sound or video and answer questions.
A few cautions are needed when interpreting benchmark scores. First, because each benchmark measures a different ability, you cannot decide a model's superiority with a single number. Second, because of possible data contamination (the model having already seen the benchmark problems during training), scores can overestimate actual generalization. Third, rankings shift quickly whenever a new model appears. So rather than accepting a leaderboard ranking at a given moment as absolute truth, it is better to use it to understand trends and areas of strength.
Limitations and Open Problems
Multimodal LLMs have made impressive progress, but they still carry several limitations.
First, visual hallucination. The model may claim an object absent from the image is present, or misread details. Errors stand out especially with small text, complex tables, and subtle spatial relationships.
Second, limits of precise perception. Counting, exact position judgment, and distinguishing subtle color or texture, all easy for humans, are often still hard for the model.
Third, the efficiency problem. As covered above, high-resolution images or long videos cause token explosion and greatly raise compute cost. In real-time applications this latency becomes an obstacle.
Fourth, modality imbalance. Because most training data concentrates on image-text, audio and video understanding are often relatively less mature. any-to-any generation quality also still has much room for improvement compared with text understanding.
Fifth, the difficulty of evaluation. Fairly and automatically evaluating the quality of generated images or audio and cross-modal consistency remains an unsolved problem.
Practical Implications
Finally, let us organize what this flow implies for practice.
When adopting a multimodal LLM, first clarify the nature of the task. If you only need to understand images and answer in text, a lightweight model of the adapter-grafting family is often sufficient. On the other hand, if you also need real-time voice conversation or image generation, you need a native-multimodal family or a combination of tools.
Efficiency matters too. If you must process large volumes of high-resolution documents, choosing a model that supports dynamic resolution or token compression directly affects cost. Open-weights families are advantageous when you need on-premises deployment and fine-grained customization.
Hallucination management cannot be left out either. Especially in accuracy-critical applications like document information extraction, you must put safeguards in place to verify model output, because the model can be confidently wrong.
Closing
Multimodal LLMs sit at the center of a great shift from "a language model that handles only text" to "a model that integrates multiple senses." Their core principle is surprisingly simple: move every modality into a common token space and process them together with an already-powerful Transformer.
On top of this simple idea, encoders, projectors, the unified token space, any-to-any routing, and a training pipeline running from alignment to instruction to preference optimization have stacked up to produce today's results. The flow from adapter grafting to native multimodal, from understanding to generation, and from single modality to arbitrary modality will continue going forward.
Still, open problems remain, such as visual hallucination, precise perception, efficiency, and modality imbalance, and the field changes very fast. So understanding the architectural principles underneath, rather than specific rankings or numbers, is what will build a lasting eye amid this fast-moving flow.
References
- Attention Is All You Need (Transformer): https://arxiv.org/abs/1706.03762
- Learning Transferable Visual Models From Natural Language Supervision (CLIP): https://arxiv.org/abs/2103.00020
- Flamingo: a Visual Language Model for Few-Shot Learning: https://arxiv.org/abs/2204.14198
- BLIP-2: https://arxiv.org/abs/2301.12597
- Visual Instruction Tuning (LLaVA): https://arxiv.org/abs/2304.08485
- Robust Speech Recognition via Large-Scale Weak Supervision (Whisper): https://arxiv.org/abs/2212.04356
- An Image is Worth 16x16 Words (ViT): https://arxiv.org/abs/2010.11929
- Qwen-VL: https://arxiv.org/abs/2308.12966
- Hugging Face Transformers documentation: https://huggingface.co/docs/transformers
- OpenAI official blog: https://openai.com/blog