- Published on
Multimodal Tokenization and Fusion — Turning Images and Audio Into Tokens
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction: Tokens Are the Gateway to Everything
- Image Tokenization: Patches and VQ
- Audio Tokenization: Discrete Codecs
- Video Frame Sampling
- Building a Unified Sequence: Interleaving
- Token Explosion and Compression
- Context Cost
- Dynamic Resolution and Arbitrary Aspect Ratios
- Multidimensional Position Encoding in Detail
- Token Budget Planning
- Comparison of Token Density per Modality
- How a VQ Codebook Actually Works
- Trade-offs in Adapter Design
- Interleaving Pitfalls: Order and Alignment
- Caching and Reusing Tokenization Results
- Comparison: Tokenization and Compression
- Pitfalls and Troubleshooting
- Conclusion
- References
Introduction: Tokens Are the Gateway to Everything
LLMs eat tokens and emit tokens. With text only, the tokenizer just splits a string into subword tokens and you are done. But to build a model that understands images, hears speech, or watches video, one question arises: how do we turn an image, audio, or video into tokens and put them in the same sequence?
That question is the heart of multimodal LLM design. The backbone of a multimodal model is usually a Transformer, and a Transformer takes a sequence of tokens (embedding vectors) as input. So every modality must ultimately become a "sequence of tokens." How you convert determines the model's capability, cost, and accuracy.
This article covers how to turn images, audio, and video into tokens, how to interleave them with text into one sequence, and how to compress when the token count explodes. If the previous article covered the principles of alignment and fusion, this one is about how that fusion is implemented at the actual input level.
Image Tokenization: Patches and VQ
There are two broad routes to turn an image into tokens.
Patch Embedding
The ViT (Vision Transformer) is the representative. Split the image into a grid of fixed-size patches (e.g., 14x14 pixels) and linearly project each patch into a vector. Each such vector becomes a "visual token." Patch tokens are continuous embeddings with no codebook.
Patch tokenization (concept)
image H x W x 3
-> split into patches (p x p)
-> patch count = (H/p) x (W/p)
-> each patch -> linear projection -> vector (dim D)
-> visual token sequence (N_patch x D)
e.g., 224x224 image, p=14 -> 16 x 16 = 256 tokens
448x448 image, p=14 -> 32 x 32 = 1024 tokens
As resolution rises, the token count grows quadratically. That is the primary driver of the token explosion we discuss later.
VQ (Vector Quantization)
The VQ-VAE/VQGAN family quantizes an image into discrete codes. An encoder turns the image into a latent grid, and each position vector is replaced by the nearest code in a codebook. The image then becomes a "grid of discrete token IDs," represented as integer indices like text tokens.
VQ image tokenization (concept)
image -> encoder -> latent grid (h x w x D)
quantize each position vector to nearest codebook code
-> code index grid (h x w), each value in {0..K-1}
-> flatten -> discrete token sequence
Pro: unified discrete vocabulary with text -> good for generation
Con: quantization loss, risk of codebook collapse
The patch route is common in understanding-oriented VLMs; the VQ route is common in unified generative models (including image generation). Designs that mix both also exist.
Audio Tokenization: Discrete Codecs
Audio is a continuous signal along time. To make it tokens, slice time into frames and quantize each frame into discrete units. A neural audio codec does this.
The codec encoder compresses the waveform into a latent representation at a fixed frame rate, and residual vector quantization discretizes it across multiple codebooks. The result is a token stream of several codes per time frame.
Discrete audio codec (concept)
waveform (high sample rate) -> codec encoder -> frame representation (low frame rate)
multi-layer quantize each frame with RVQ
-> Q codes per frame (hierarchical)
-> [t=1: c1..cQ][t=2: c1..cQ]... discrete audio token stream
Note: frame rate and number of codebooks set the bitrate/quality
tokens per second can far exceed text
A key trap is the density of audio tokens. One second of speech easily becomes tens to hundreds of tokens, so long audio drains context quickly. Choose frame rate and quantization depth carefully.
Video Frame Sampling
Video is a sequence of images, so its token count explodes most easily. A 30fps clip of 10 seconds is 300 frames, and at hundreds of tokens per frame, that is tens of thousands of tokens. Frame sampling is therefore essential.
Video tokenization strategy (concept)
source: 30fps x duration
-> frame sampling (e.g., downsample to 1~2 fps)
-> prefer keyframes/scene changes (optional)
-> patch-tokenize each frame
-> insert temporal separator tokens (frame boundaries)
-> spatiotemporal token sequence
Extra compression: merge/pool adjacent frame tokens to remove redundancy
Sampling is a trade-off between information loss and cost. To capture fast motion you must raise the frame rate, which increases tokens and cost.
Building a Unified Sequence: Interleaving
Once each modality is tokenized, you must weave them into one sequence. The common approach is interleaving — inserting visual tokens at placeholder positions in the text.
Interleaved input layout (concept)
user input:
"Describe the following image [IMG] and also this sound [AUDIO]"
expanded into a token sequence:
[txt: Describe the following image]
[IMG_START][v1 v2 ... v256][IMG_END]
[txt: and also this sound]
[AUD_START][a1 a2 ... aM][AUD_END]
The LLM attends over this single sequence as a whole.
The crucial point is that one placeholder slot actually expands into hundreds of tokens. A placeholder is one or two tokens in the text prompt, but once expanded it takes a large share of the context.
Modality Embeddings and Separator Tokens
It helps if the model knows which modality a token came from. Two devices serve this.
- Separator (special) tokens: special tokens marking image start/end and audio start/end to make boundaries explicit.
- Modality embeddings: a learned embedding indicating the modality type is added to the token embedding, distinguishing the source even at the same position.
Position information matters too. Images need 2D spatial positions; video needs time-plus-space positions. To handle arbitrary resolutions, recent VLMs encode time, height, and width coordinates together with a multidimensional rotary position scheme (e.g., M-RoPE style).
Position encoding dimensions (concept)
text: 1D position (order)
image: 2D position (row, column)
video: 3D position (time, row, column)
Multidimensional RoPE: split each axis and encode as rotary positions
-> flexible across arbitrary resolution/length
Token Explosion and Compression
The most practical problem in multimodal models is token count. Transformer attention cost grows quadratically with sequence length, so many visual tokens spike compute and memory. Compression is therefore a core technique.
Token Pruning
Drop low-importance tokens. Estimate a token's informativeness from attention scores or a learned score, and remove less important patch tokens such as background. Reducing gradually across layers is common.
Token Merging
Combine similar tokens into one. Merging adjacent or embedding-similar tokens by average/weighted sum reduces count while relatively preserving information.
Q-Former-Style Compression
Use learnable queries and cross-attention with visual features as key/value to condense information into a small number of query tokens. Regardless of input patch count, only a fixed small number of tokens passes to the LLM.
Q-Former-style compression (concept)
visual patch features (N_patch x D) <- many (e.g., 1024)
|
learnable query (Q x D) <- few (e.g., 32~64)
|
cross-attention: queries attend to patches, condensing info
|
compressed visual tokens (Q x D) -> LLM input
Effect: visual tokens the LLM sees drop from 1024 -> 64
Caution: over-compression loses detail (small text, OCR)
Compression is not a cure-all. For detail-heavy tasks like OCR or reading fine charts, over-compression hurts performance. Tune compression strength to the task.
Context Cost
Token count is cost. Watch three aspects.
- Compute cost: attention is quadratic in sequence length, FFN linear. Many visual tokens greatly increase prefill-stage compute.
- Memory cost: KV cache scales with token count. Long multimodal inputs eat KV cache memory quickly.
- Latency/billing: API billing and latency track token count directly. Casually feeding high-resolution images spikes cost.
How token count affects cost (concept)
visual tokens N_v increase ->
- prefill FLOPs: attention O((N_v + N_t)^2) term grows
- KV cache memory: O(N_v + N_t) grows
- TTFT (time to first token): longer prefill -> grows
Mitigations: resolution caps, dynamic resolution, compression (pruning/merging/Q-Former)
In practice, design input resolution policy (caps, dynamic adjustment), compression strength, and caching together. At equal accuracy, fewer tokens is almost always better.
Dynamic Resolution and Arbitrary Aspect Ratios
Real images are not square. Wide panoramas, tall document scans, tiny icons — aspect ratios and sizes vary. Force-resizing to a fixed resolution (e.g., 224x224) breaks the aspect ratio and distorts information. Text-dense images such as documents or charts especially suffer, with characters smeared by force-resizing.
Recent VLMs support arbitrary resolution (naive dynamic resolution). They build the patch grid dynamically while preserving the original aspect ratio, so visual token count varies per image. For this, position encoding must also handle arbitrary grids.
Fixed vs dynamic resolution (concept)
[fixed] force-resize every image to 224x224
-> aspect distortion, small text smeared
-> token count always identical (easy to predict)
[dynamic] preserve original aspect ratio, variable patch grid
-> less distortion, good for documents/charts
-> token count varies per image (hard to predict)
Compromise: set min/max token caps and adjust dynamically within them
Dynamic resolution helps quality but, with larger token-count variation, complicates memory planning and batching. So one usually sets min/max token caps and only adjusts dynamically within that range.
Multidimensional Position Encoding in Detail
Position encoding needs different dimensionality per modality. Text needs 1D, images 2D, video 3D positions. Using a single 1D position cannot properly express an image's spatial structure or a video's temporal structure.
Extending rotary position encoding (RoPE) to multiple dimensions solves this. Split the embedding dimension into several groups and encode the position of a different axis (time/height/width) into each group as rotation. Then one token can carry temporal and spatial coordinates at once.
Multidimensional RoPE split (concept)
split embedding dim D into three sections:
[ time-axis rotation | height-axis rotation | width-axis rotation ]
D_t D_h D_w
encode each token's (t, h, w) coords as rotation in the matching section
-> handle text (t only), image (h,w), video (t,h,w) in one unified frame
-> flexible extrapolation to arbitrary resolution/length
The advantage is relatively good generalization to resolutions or lengths unseen during training. But if training and inference distributions differ too much, extrapolation can break, so exposing diverse resolutions during training is wise.
Token Budget Planning
When handling multimodal input, plan the token budget explicitly. Unlike counting text only, visual tokens expand and drain the budget quickly.
Token budget calculation (concept)
total context limit = C (e.g., 32768)
input tokens = text tokens + visual tokens (after expansion) + special tokens
output headroom = C - input tokens
Checks:
- did you count visual tokens as the "expanded" value, not the placeholder?
- did you sum across multiple images?
- did you leave headroom for output length?
If over: lower resolution, strengthen compression, limit image count
Many production bugs come from mistaking the placeholder for one token and miscalculating the budget. Always compute with the expanded length.
Comparison of Token Density per Modality
Even for the same one second or one image, token density varies greatly by modality. Having this sense lets you gauge cost and context intuitively.
Rough token density (concept, model/config dependent)
text 1 paragraph (~100 words) -> ~130 tokens
image 1 medium-resolution image -> hundreds of tokens
image 1 high-resolution document -> thousands of tokens
audio 1 second of speech -> tens of tokens
video 1 second (after sampling) -> hundreds to thousands of tokens
Implication: density order is video > high-res image > audio > text
long video drains context the fastest
These are not exact numbers but give a sense of relative scale. Remembering that video and high-resolution images are the main context consumers makes design easier.
How a VQ Codebook Actually Works
Let us look deeper at VQ tokenization. A codebook consists of K learned vectors (codes). Quantization replaces each position vector produced by the encoder with the nearest code in the codebook.
VQ quantization step (concept)
encoder output z (per-position vector)
codebook E = [e_0, e_1, ..., e_{K-1}]
for each z:
nearest code index = argmin_k distance(z, e_k)
quantized output = e_{nearest}
training:
- encoder moves toward the codes
- codebook moves used codes toward the input mean
- pass gradients via straight-through estimation
A chronic VQ problem is codebook collapse. Only a few codes keep getting selected while the rest die, reducing representational diversity. To prevent this, codebook restarts (reinitializing dead codes), exponential moving average (EMA) updates, and a commitment loss are used. Monitoring codebook utilization (how many codes are actually used) matters.
Trade-offs in Adapter Design
The adapter (projector) that moves vision features into the LLM space has several design options, each balancing token count against information preservation differently.
Adapter type comparison (concept)
[linear/MLP projector]
map patch features per-position into LLM dimension
token count = patch count (no compression)
pro: simple, good information preservation
con: many patches means many tokens
[Q-Former style (learnable query)]
condense info into a few queries
token count = query count (fixed, compressed)
pro: easy control of token count
con: detail loss when over-compressed, trickier to train
[pooling/merging based]
group adjacent patch tokens to downsample
moderate compression
The practical choice depends on the task. When detail matters (OCR, documents), a low-compression linear/MLP path is favorable; when summarization suffices (general scene understanding), a Q-Former style can help. Many recent VLMs get good results with a simple MLP projector plus dynamic resolution.
Interleaving Pitfalls: Order and Alignment
In interleaving, token order and alignment are subtle but important. The same information arranged in a different order can be interpreted differently by the model.
Effect of interleaving order (concept)
[image first]
[IMG ...][question text]
-> model sees the image first, then answers the question
[question first]
[question text][IMG ...]
-> interpret the image on top of the question context
Implication: order affects attention flow and results
using an order consistent with the training distribution is safe
Also, when inserting multiple images, it helps to attach an index or label to each so you can clearly refer to "the first image" and "the second image." Along with separator tokens, such labeling improves multi-image reasoning accuracy.
Caching and Reusing Tokenization Results
Visual tokenization is expensive — it involves a vision-encoder forward pass and projector compute. But when the same image appears repeatedly (multi-turn conversation, repeated queries on one document, popular images), you can cache and reuse the tokenization result.
Visual token caching (concept)
cache key = hash(image bytes + preprocessing params)
preprocessing params: target resolution, normalization settings, etc.
request -> compute key
hit: use cached visual tokens immediately (skip encoder)
miss: encode, then store the result
Caution:
- omitting preprocessing params from the key reuses wrong tokens
- needs memory/disk limits and an expiration policy
Caching does not reduce tokenization itself, but it greatly cuts repeat cost. The effect is largest when repeatedly processing high-resolution images with many tokens. The cache key design must include preprocessing params to avoid wrong reuse.
Comparison: Tokenization and Compression
| Item | Method | Pros | Cons |
|---|---|---|---|
| Image | patch (ViT) | continuous representation, strong for understanding | tokens grow with resolution |
| Image | VQ discrete | unified vocabulary with text, good for generation | quantization loss, codebook collapse |
| Audio | discrete codec (RVQ) | unified discrete tokens | high tokens-per-second density |
| Compression | pruning | simple, effective | risk of losing key tokens |
| Compression | merging | information preserving | implementation complexity |
| Compression | Q-Former | fixed small token count | detail loss when over-compressed |
Pitfalls and Troubleshooting
- Forgetting placeholder expansion: if you forget that one placeholder expands into hundreds of tokens, your context budget math is wrong. Always compute with the expanded length.
- Missing/conflicting separators: without modality boundary tokens, the model may conflate text and visuals.
- Resolution runaway: feeding high-resolution images as-is explodes tokens, spiking latency and cost. Use dynamic resolution/caps.
- Over-compression: detail vanishes in OCR, small objects, and charts. Vary compression strength per task.
- Audio/video length: tokens grow linearly with length. You need a sampling rate and chunking strategy.
- Position encoding mismatch: if training and inference resolutions/frame rates differ greatly, positional generalization can break.
Conclusion
Multimodal LLMs ultimately rest on the technique of "turning everything into tokens." Images are tokenized with patches or VQ, audio with discrete codecs, and video with sampling followed by patches. These tokens are marked with separator tokens and modality embeddings and interleaved with text into one sequence.
The biggest practical challenge is token explosion. Reduce tokens with pruning, merging, and Q-Former-style compression, but guard against over-compression in detail tasks like OCR. Since token count equals compute, memory, and billing, designing resolution policy together with compression is the key to cost efficiency. The next article covers the new challenges that arise when serving such multimodal inputs in production.