Skip to content

필사 모드: Multimodal Tokenization and Fusion — Turning Images and Audio Into Tokens

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction: Tokens Are the Gateway to Everything

LLMs eat tokens and emit tokens. With text only, the tokenizer just splits a string into subword tokens and you are done. But to build a model that understands images, hears speech, or watches video, one question arises: how do we turn an image, audio, or video into tokens and put them in the same sequence?

That question is the heart of multimodal LLM design. The backbone of a multimodal model is usually a Transformer, and a Transformer takes a sequence of tokens (embedding vectors) as input. So every modality must ultimately become a "sequence of tokens." How you convert determines the model's capability, cost, and accuracy.

This article covers how to turn images, audio, and video into tokens, how to interleave them with text into one sequence, and how to compress when the token count explodes. If the previous article covered the principles of alignment and fusion, this one is about how that fusion is implemented at the actual input level.

Image Tokenization: Patches and VQ

There are two broad routes to turn an image into tokens.

Patch Embedding

The ViT (Vision Transformer) is the representative. Split the image into a grid of fixed-size patches (e.g., 14x14 pixels) and linearly project each patch into a vector. Each such vector becomes a "visual token." Patch tokens are continuous embeddings with no codebook.

Patch tokenization (concept)

image H x W x 3

-> split into patches (p x p)

-> patch count = (H/p) x (W/p)

-> each patch -> linear projection -> vector (dim D)

-> visual token sequence (N_patch x D)

e.g., 224x224 image, p=14 -> 16 x 16 = 256 tokens

448x448 image, p=14 -> 32 x 32 = 1024 tokens

As resolution rises, the token count grows quadratically. That is the primary driver of the token explosion we discuss later.

VQ (Vector Quantization)

The VQ-VAE/VQGAN family quantizes an image into discrete codes. An encoder turns the image into a latent grid, and each position vector is replaced by the nearest code in a codebook. The image then becomes a "grid of discrete token IDs," represented as integer indices like text tokens.

VQ image tokenization (concept)

image -> encoder -> latent grid (h x w x D)

quantize each position vector to nearest codebook code

-> code index grid (h x w), each value in {0..K-1}

-> flatten -> discrete token sequence

Pro: unified discrete vocabulary with text -> good for generation

Con: quantization loss, risk of codebook collapse

The patch route is common in understanding-oriented VLMs; the VQ route is common in unified generative models (including image generation). Designs that mix both also exist.

Audio Tokenization: Discrete Codecs

Audio is a continuous signal along time. To make it tokens, slice time into frames and quantize each frame into discrete units. A neural audio codec does this.

The codec encoder compresses the waveform into a latent representation at a fixed frame rate, and residual vector quantization discretizes it across multiple codebooks. The result is a token stream of several codes per time frame.

Discrete audio codec (concept)

waveform (high sample rate) -> codec encoder -> frame representation (low frame rate)

multi-layer quantize each frame with RVQ

-> Q codes per frame (hierarchical)

-> [t=1: c1..cQ][t=2: c1..cQ]... discrete audio token stream

Note: frame rate and number of codebooks set the bitrate/quality

tokens per second can far exceed text

A key trap is the density of audio tokens. One second of speech easily becomes tens to hundreds of tokens, so long audio drains context quickly. Choose frame rate and quantization depth carefully.

Video Frame Sampling

Video is a sequence of images, so its token count explodes most easily. A 30fps clip of 10 seconds is 300 frames, and at hundreds of tokens per frame, that is tens of thousands of tokens. Frame sampling is therefore essential.

Video tokenization strategy (concept)

source: 30fps x duration

-> frame sampling (e.g., downsample to 1~2 fps)

-> prefer keyframes/scene changes (optional)

-> patch-tokenize each frame

-> insert temporal separator tokens (frame boundaries)

-> spatiotemporal token sequence

Extra compression: merge/pool adjacent frame tokens to remove redundancy

Sampling is a trade-off between information loss and cost. To capture fast motion you must raise the frame rate, which increases tokens and cost.

Building a Unified Sequence: Interleaving

Once each modality is tokenized, you must weave them into one sequence. The common approach is interleaving — inserting visual tokens at placeholder positions in the text.

Interleaved input layout (concept)

user input:

"Describe the following image [IMG] and also this sound [AUDIO]"

expanded into a token sequence:

[txt: Describe the following image]

[IMG_START][v1 v2 ... v256][IMG_END]

[txt: and also this sound]

[AUD_START][a1 a2 ... aM][AUD_END]

The LLM attends over this single sequence as a whole.

The crucial point is that one placeholder slot actually expands into hundreds of tokens. A placeholder is one or two tokens in the text prompt, but once expanded it takes a large share of the context.

Modality Embeddings and Separator Tokens

It helps if the model knows which modality a token came from. Two devices serve this.

- **Separator (special) tokens**: special tokens marking image start/end and audio start/end to make boundaries explicit.

- **Modality embeddings**: a learned embedding indicating the modality type is added to the token embedding, distinguishing the source even at the same position.

Position information matters too. Images need 2D spatial positions; video needs time-plus-space positions. To handle arbitrary resolutions, recent VLMs encode time, height, and width coordinates together with a multidimensional rotary position scheme (e.g., M-RoPE style).

Position encoding dimensions (concept)

text: 1D position (order)

image: 2D position (row, column)

video: 3D position (time, row, column)

Multidimensional RoPE: split each axis and encode as rotary positions

-> flexible across arbitrary resolution/length

Token Explosion and Compression

The most practical problem in multimodal models is token count. Transformer attention cost grows quadratically with sequence length, so many visual tokens spike compute and memory. Compression is therefore a core technique.

Token Pruning

Drop low-importance tokens. Estimate a token's informativeness from attention scores or a learned score, and remove less important patch tokens such as background. Reducing gradually across layers is common.

Token Merging

Combine similar tokens into one. Merging adjacent or embedding-similar tokens by average/weighted sum reduces count while relatively preserving information.

Q-Former-Style Compression

Use learnable queries and cross-attention with visual features as key/value to condense information into a small number of query tokens. Regardless of input patch count, only a fixed small number of tokens passes to the LLM.

Q-Former-style compression (concept)

visual patch features (N_patch x D) <- many (e.g., 1024)

|

learnable query (Q x D) <- few (e.g., 32~64)

|

cross-attention: queries attend to patches, condensing info

|

compressed visual tokens (Q x D) -> LLM input

Effect: visual tokens the LLM sees drop from 1024 -> 64

Caution: over-compression loses detail (small text, OCR)

Compression is not a cure-all. For detail-heavy tasks like OCR or reading fine charts, over-compression hurts performance. Tune compression strength to the task.

Context Cost

Token count is cost. Watch three aspects.

- **Compute cost**: attention is quadratic in sequence length, FFN linear. Many visual tokens greatly increase prefill-stage compute.

- **Memory cost**: KV cache scales with token count. Long multimodal inputs eat KV cache memory quickly.

- **Latency/billing**: API billing and latency track token count directly. Casually feeding high-resolution images spikes cost.

How token count affects cost (concept)

visual tokens N_v increase ->

- prefill FLOPs: attention O((N_v + N_t)^2) term grows

- KV cache memory: O(N_v + N_t) grows

- TTFT (time to first token): longer prefill -> grows

Mitigations: resolution caps, dynamic resolution, compression (pruning/merging/Q-Former)

In practice, design input resolution policy (caps, dynamic adjustment), compression strength, and caching together. At equal accuracy, fewer tokens is almost always better.

Dynamic Resolution and Arbitrary Aspect Ratios

Real images are not square. Wide panoramas, tall document scans, tiny icons — aspect ratios and sizes vary. Force-resizing to a fixed resolution (e.g., 224x224) breaks the aspect ratio and distorts information. Text-dense images such as documents or charts especially suffer, with characters smeared by force-resizing.

Recent VLMs support arbitrary resolution (naive dynamic resolution). They build the patch grid dynamically while preserving the original aspect ratio, so visual token count varies per image. For this, position encoding must also handle arbitrary grids.

Fixed vs dynamic resolution (concept)

[fixed] force-resize every image to 224x224

-> aspect distortion, small text smeared

-> token count always identical (easy to predict)

[dynamic] preserve original aspect ratio, variable patch grid

-> less distortion, good for documents/charts

-> token count varies per image (hard to predict)

Compromise: set min/max token caps and adjust dynamically within them

Dynamic resolution helps quality but, with larger token-count variation, complicates memory planning and batching. So one usually sets min/max token caps and only adjusts dynamically within that range.

Multidimensional Position Encoding in Detail

Position encoding needs different dimensionality per modality. Text needs 1D, images 2D, video 3D positions. Using a single 1D position cannot properly express an image's spatial structure or a video's temporal structure.

Extending rotary position encoding (RoPE) to multiple dimensions solves this. Split the embedding dimension into several groups and encode the position of a different axis (time/height/width) into each group as rotation. Then one token can carry temporal and spatial coordinates at once.

Multidimensional RoPE split (concept)

split embedding dim D into three sections:

[ time-axis rotation | height-axis rotation | width-axis rotation ]

D_t D_h D_w

encode each token's (t, h, w) coords as rotation in the matching section

-> handle text (t only), image (h,w), video (t,h,w) in one unified frame

-> flexible extrapolation to arbitrary resolution/length

The advantage is relatively good generalization to resolutions or lengths unseen during training. But if training and inference distributions differ too much, extrapolation can break, so exposing diverse resolutions during training is wise.

Token Budget Planning

When handling multimodal input, plan the token budget explicitly. Unlike counting text only, visual tokens expand and drain the budget quickly.

Token budget calculation (concept)

total context limit = C (e.g., 32768)

input tokens = text tokens + visual tokens (after expansion) + special tokens

output headroom = C - input tokens

Checks:

- did you count visual tokens as the "expanded" value, not the placeholder?

- did you sum across multiple images?

- did you leave headroom for output length?

If over: lower resolution, strengthen compression, limit image count

Many production bugs come from mistaking the placeholder for one token and miscalculating the budget. Always compute with the expanded length.

Comparison of Token Density per Modality

Even for the same one second or one image, token density varies greatly by modality. Having this sense lets you gauge cost and context intuitively.

Rough token density (concept, model/config dependent)

text 1 paragraph (~100 words) -> ~130 tokens

image 1 medium-resolution image -> hundreds of tokens

image 1 high-resolution document -> thousands of tokens

audio 1 second of speech -> tens of tokens

video 1 second (after sampling) -> hundreds to thousands of tokens

Implication: density order is video > high-res image > audio > text

long video drains context the fastest

These are not exact numbers but give a sense of relative scale. Remembering that video and high-resolution images are the main context consumers makes design easier.

How a VQ Codebook Actually Works

Let us look deeper at VQ tokenization. A codebook consists of K learned vectors (codes). Quantization replaces each position vector produced by the encoder with the nearest code in the codebook.

VQ quantization step (concept)

encoder output z (per-position vector)

codebook E = [e_0, e_1, ..., e_{K-1}]

for each z:

nearest code index = argmin_k distance(z, e_k)

quantized output = e_{nearest}

training:

- encoder moves toward the codes

- codebook moves used codes toward the input mean

- pass gradients via straight-through estimation

A chronic VQ problem is codebook collapse. Only a few codes keep getting selected while the rest die, reducing representational diversity. To prevent this, codebook restarts (reinitializing dead codes), exponential moving average (EMA) updates, and a commitment loss are used. Monitoring codebook utilization (how many codes are actually used) matters.

Trade-offs in Adapter Design

The adapter (projector) that moves vision features into the LLM space has several design options, each balancing token count against information preservation differently.

Adapter type comparison (concept)

[linear/MLP projector]

map patch features per-position into LLM dimension

token count = patch count (no compression)

pro: simple, good information preservation

con: many patches means many tokens

[Q-Former style (learnable query)]

condense info into a few queries

token count = query count (fixed, compressed)

pro: easy control of token count

con: detail loss when over-compressed, trickier to train

[pooling/merging based]

group adjacent patch tokens to downsample

moderate compression

The practical choice depends on the task. When detail matters (OCR, documents), a low-compression linear/MLP path is favorable; when summarization suffices (general scene understanding), a Q-Former style can help. Many recent VLMs get good results with a simple MLP projector plus dynamic resolution.

Interleaving Pitfalls: Order and Alignment

In interleaving, token order and alignment are subtle but important. The same information arranged in a different order can be interpreted differently by the model.

Effect of interleaving order (concept)

[image first]

[IMG ...][question text]

-> model sees the image first, then answers the question

[question first]

[question text][IMG ...]

-> interpret the image on top of the question context

Implication: order affects attention flow and results

using an order consistent with the training distribution is safe

Also, when inserting multiple images, it helps to attach an index or label to each so you can clearly refer to "the first image" and "the second image." Along with separator tokens, such labeling improves multi-image reasoning accuracy.

Caching and Reusing Tokenization Results

Visual tokenization is expensive — it involves a vision-encoder forward pass and projector compute. But when the same image appears repeatedly (multi-turn conversation, repeated queries on one document, popular images), you can cache and reuse the tokenization result.

Visual token caching (concept)

cache key = hash(image bytes + preprocessing params)

preprocessing params: target resolution, normalization settings, etc.

request -> compute key

hit: use cached visual tokens immediately (skip encoder)

miss: encode, then store the result

Caution:

- omitting preprocessing params from the key reuses wrong tokens

- needs memory/disk limits and an expiration policy

Caching does not reduce tokenization itself, but it greatly cuts repeat cost. The effect is largest when repeatedly processing high-resolution images with many tokens. The cache key design must include preprocessing params to avoid wrong reuse.

Comparison: Tokenization and Compression

| Item | Method | Pros | Cons |

| --- | --- | --- | --- |

| Image | patch (ViT) | continuous representation, strong for understanding | tokens grow with resolution |

| Image | VQ discrete | unified vocabulary with text, good for generation | quantization loss, codebook collapse |

| Audio | discrete codec (RVQ) | unified discrete tokens | high tokens-per-second density |

| Compression | pruning | simple, effective | risk of losing key tokens |

| Compression | merging | information preserving | implementation complexity |

| Compression | Q-Former | fixed small token count | detail loss when over-compressed |

Pitfalls and Troubleshooting

- **Forgetting placeholder expansion**: if you forget that one placeholder expands into hundreds of tokens, your context budget math is wrong. Always compute with the expanded length.

- **Missing/conflicting separators**: without modality boundary tokens, the model may conflate text and visuals.

- **Resolution runaway**: feeding high-resolution images as-is explodes tokens, spiking latency and cost. Use dynamic resolution/caps.

- **Over-compression**: detail vanishes in OCR, small objects, and charts. Vary compression strength per task.

- **Audio/video length**: tokens grow linearly with length. You need a sampling rate and chunking strategy.

- **Position encoding mismatch**: if training and inference resolutions/frame rates differ greatly, positional generalization can break.

Conclusion

Multimodal LLMs ultimately rest on the technique of "turning everything into tokens." Images are tokenized with patches or VQ, audio with discrete codecs, and video with sampling followed by patches. These tokens are marked with separator tokens and modality embeddings and interleaved with text into one sequence.

The biggest practical challenge is token explosion. Reduce tokens with pruning, merging, and Q-Former-style compression, but guard against over-compression in detail tasks like OCR. Since token count equals compute, memory, and billing, designing resolution policy together with compression is the key to cost efficiency. The next article covers the new challenges that arise when serving such multimodal inputs in production.

References

- [Attention Is All You Need (arXiv 1706.03762)](https://arxiv.org/abs/1706.03762)

- [FlashAttention (arXiv 2205.14135)](https://arxiv.org/abs/2205.14135)

- [Qwen2-VL (arXiv 2409.12191)](https://arxiv.org/abs/2409.12191)

- [arXiv cs.CV listing](https://arxiv.org/list/cs.CV/recent)

- [arXiv cs.CL listing](https://arxiv.org/list/cs.CL/recent)

- [Hugging Face docs](https://huggingface.co/docs)

- [PyTorch docs](https://pytorch.org/docs/stable/index.html)

- [QwenLM GitHub](https://github.com/QwenLM)

- [vLLM docs](https://docs.vllm.ai/)

현재 단락 (1/207)

LLMs eat tokens and emit tokens. With text only, the tokenizer just splits a string into subword tok...

작성 글자: 0원문 글자: 16,152작성 단락: 0/207