Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction: Why Vision-Language Models

Once text-only LLMs began understanding images, a whole set of domains opened up at once: document analysis, chart reading, UI automation, robotics. But an LLM is fundamentally a model that consumes a token sequence and predicts the next token. How can image data, a two-dimensional grid of pixels, be placed on the same plane as text tokens?

The core idea is simple. Cut the image into small pieces, turn each piece into a vector, then align those vectors into the embedding space the LLM operates in. Once you do that, each image patch behaves like a single word token. In this post we examine that transformation in depth through three components: the vision encoder, the projector, and the LLM decoder.

Understand this structure well and it becomes obvious why some models excel at high-resolution documents while others blow up their token budget, and why arbitrary-resolution handling is such a hard problem.

What Differs From a Text Model

A pure text LLM already has tokens as input. The tokenizer turns a string into a sequence of integer IDs, the embedding table maps each ID to a vector, and that is it. But an image has no tokenizer. There is no table to map pixels directly to integer IDs.

So a vision LLM places a neural network called the vision encoder instead of a tokenizer. This is the core difference. Text gets tokens by lookup table, while an image is passed through a learned encoder to get continuous feature vectors. These vectors play the role of visual tokens.

text path: string -> tokenizer (lookup) -> token IDs -> embedding -> vector

image path: pixels -> vision encoder (neural net) -> continuous features -> projector -> vector

The ends of the two paths are the same. Both converge to a sequence of vectors of the same dimension the LLM can handle. Only the starting point differs. Once you hold this view, you see that a vision LLM is in the end a text LLM with one more image input path attached.

Core Principles: A Vision LLM in Three Parts

Most modern vision-language models (VLMs) consist of three components.

1. **Vision Encoder**: usually a ViT family model. It takes an image and converts it into a sequence of visual feature vectors.

2. **Vision-Language Projector / Adapter**: it maps the vision encoder's output dimension to the LLM's embedding dimension and aligns them semantically. A linear layer, an MLP, or a Q-Former-style compression module is used.

3. **LLM Decoder**: it receives text tokens and projected visual tokens as one sequence and predicts the next token.

The overall data flow looks like this.

[image]

[patch split] --> flatten patches and embed

[ViT vision encoder] --> sequence of visual features

[projector / adapter] --> project to LLM dim (+ compression)

[visual tokens] + [text tokens] --> interleaved into one sequence

[LLM decoder] --> autoregressively generate text

This stack is sometimes trained end to end, and sometimes only the projector and LLM are trained while the vision encoder stays frozen. Training strategy is covered in a separate post; here we focus on how data flows at inference.

Deep Dive 1: From Patches to Tokens

Patch Splitting and Patch Embedding

A ViT begins by dividing the image into fixed-size patches. Split a 224 x 224 image into 14 x 14 pixel patches and you get 16 across and 16 down, 256 patches in total. Each patch is flattened and passed through a linear projection to become a single embedding vector.

Tracking it by tensor shape:

input image: B x 3 x H x W (batch, channels, height, width)

after patch split: B x N_patch x (P*P*3) (P is patch side length in pixels)

after patch embed: B x N_patch x D (D is vision encoder hidden dim)

Here N_patch is (H / P) x (W / P). With a 224 x 224 image and patch size 14, N_patch is 256. At this point the image is already a sequence, and from here it passes through transformer layers exactly like a text token sequence.

Position Information and Transformer Encoding

Flattening patches destroys 2D spatial information, so the ViT adds position embeddings. The sequence then passes through several layers of self-attention and FFN, where each patch absorbs information from its neighbors. The output has the same length as the input, and each position is now a visually enriched feature vector.

The attention inside a ViT is the same scaled dot-product form as in a text transformer.

attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Q, K, V are the query, key, and value obtained by linear projection from patch features, and d_k is the per-head dimension. Multi-head attention learns relationships in several subspaces at once.

Token Count Is Cost

One important intuition at this stage. The number of visual tokens scales with image resolution. Double the resolution and the patch count roughly quadruples, and the sequence the LLM must process grows accordingly. Since LLM attention costs scale quadratically with sequence length, reducing visual token count maps directly to cost. That is exactly why a projector sometimes compresses tokens rather than merely matching dimensions.

Deep Dive 2: The Projector — A Bridge Between Vision and Language

The vision encoder's output dimension may differ from the LLM's embedding dimension, and their semantic spaces are not aligned. The projector closes this gap. There are two main families.

Linear / MLP Projector

The simplest approach. Each visual feature the vision encoder emits is projected into the LLM dimension by a linear layer or a two-layer MLP. The token count stays the same.

vision encoder out: B x N_v x D_vis

after linear/MLP: B x N_v x D_llm (token count N_v preserved, only dim D_vis -> D_llm)

- Pros: simple structure, passes visual information through without loss.

- Cons: token count is not reduced, so sequences grow long and costly at high resolution.

The LLaVA family popularized this simple MLP approach. It is easy to implement and performs well, so it is widely used.

Q-Former-Style Learnable Query Compression

The other family places learnable query vectors and uses cross-attention to pull information out of the visual features into a fixed number of tokens. BLIP-2's Q-Former is the canonical example.

learnable queries: B x N_q x D (N_q fixed, e.g. 32, independent of patch count)

vision features (K/V): B x N_v x D_vis

cross-attention: queries absorb information from vision features

output: B x N_q x D_llm (compressed N_v -> N_q)

- Pros: compresses visual tokens to a fixed small number, shortening the LLM sequence and saving cost.

- Cons: fine visual detail can be lost in compression, which hurts precise OCR or document understanding.

The compressing family suits tasks where the big picture matters, such as natural-image captioning, while the non-compressing (MLP) family tends to be better for document work that requires reading small text or tables.

Deep Dive 3: Injecting Visual Tokens Into the LLM

Visual tokens that have passed through the projector now live in the LLM's embedding space. Weaving them into one sequence with text is called interleaving.

The Basic Form of Interleaving

In the chat input, a special placeholder token marks where an image goes, and just before feeding the LLM that slot is replaced by the visual tokens.

original sequence: [text...] [IMG_PLACEHOLDER] [text...]

after substitution: [text...] [v1][v2]...[vN] [text...]

The resulting unified sequence is, from the LLM's view, just one long embedding sequence. Self-attention sees text tokens and visual tokens together without distinction, learning to let the question text point at a specific region of the image.

Multiple images and interleaved image-text documents follow the same principle: insert a visual token block in the order images appear.

multi-image: [text] [img1 tokens] [text] [img2 tokens] [text]

Causal Mask and Visual Tokens

The LLM decoder uses a causal mask so each position sees only tokens before it. Visual tokens follow this rule too. Text that comes after an image can reference the image tokens, but the image tokens themselves usually see only their preceding context. Some models allow bidirectional attention among image tokens so that within one image all patches can see each other. This is a per-implementation design choice.

Deep Dive 4: Arbitrary-Resolution Handling

Early VLMs force-resized every input to a fixed size (e.g., 336 x 336). That smears wide documents or small text and loses information. Recent models try to handle arbitrary resolution as is.

Qwen2-VL's Naive Dynamic Resolution

Qwen2-VL does not force a fixed size; instead it dynamically generates a number of visual tokens proportional to the original resolution. Larger images become more tokens, smaller images fewer. As a result aspect ratio and fine detail are preserved, which helps document and chart understanding.

small image (e.g. 448x448): few visual tokens

large image (e.g. 1568x1568): many visual tokens

wide document: original ratio kept, more patches along the width

In practice you cap the minimum and maximum token counts so an overly large image does not explode the context.

M-RoPE: Multimodal Rotary Position Embedding

Position encoding is also a problem. Text has a 1D order while an image is a 2D grid. Qwen2-VL introduces M-RoPE (Multimodal Rotary Position Embedding), decomposing position into several components: time, height, width. Text tokens share the same value across these components, while image tokens have height and width components that reflect 2D coordinates.

text token: (t, t, t) three components share one position index

image token: (t, h, w) height/width components reflect grid coordinates

This lets a single model represent 1D text and 2D image positions consistently, and extends naturally to cases like video where a time axis is added.

Comparison Table: Projector Design Choices

| Item | Linear / MLP Projector | Q-Former-style Compression |

| --- | --- | --- |

| Token count | preserves patch count | compresses to a fixed small number |

| Visual fidelity | high, near lossless | some loss during compression |

| LLM sequence length | long, costly | short, cost saving |

| Document/OCR fit | favorable | relatively weaker |

| Natural-image captioning | suitable | suitable |

| Implementation complexity | low | high |

| Representative case | LLaVA family | BLIP-2 Q-Former |

| Item | Fixed Resolution | Arbitrary (Dynamic) Resolution |

| --- | --- | --- |

| Input handling | resize to fixed size | preserves original ratio/resolution |

| Small text/detail | risk of loss | well preserved |

| Token count | constant | scales with image size, variable |

| Document/chart understanding | weak | strong |

| Context cost | predictable | needs cap management |

Deep Dive 5: Tracing the Token Budget With Numbers

Abstract explanations do not build a sense of cost, so let us trace concrete numbers. Assume patch size 14 and vision encoder hidden dim 1024.

case A: ordinary photo 896 x 896

patches across = 896 / 14 = 64

patches down = 896 / 14 = 64

visual tokens = 64 x 64 = 4096

case B: wide document 1568 x 784

patches across = 1568 / 14 = 112

patches down = 784 / 14 = 56

visual tokens = 112 x 56 = 6272

case C: thumbnail 224 x 224

patches across/down = 16 x 16

visual tokens = 256

One thing becomes clear. With the same model, visual tokens swing from 256 to over 6000 depending on input image size. Put several images in a multi-turn conversation and visual tokens may take most of the context. That is why many models place downsampling such as 2 x 2 patch merging after the vision encoder, cutting the tokens passed to the LLM to one quarter.

patch merge (2x2) example

vision encoder output: 64 x 64 = 4096 tokens

after 2x2 merge: 32 x 32 = 1024 tokens (merge four into one)

This downsampling sacrifices a little detail but greatly reduces the LLM sequence. When small text matters as in documents, lower the merge strength; when the big picture matters as in ordinary photos, cut more aggressively, tuning to the task.

Deep Dive 6: Choosing the ViT Encoder and Its Pretraining

The vision encoder is not just any ViT; usually you bring an encoder pretrained with large-scale image-text contrastive learning. An encoder trained paired with text already emits visual representations somewhat aligned with language, so the burden on the following projector drops.

- **Contrastive pretrained encoder**: an encoder trained to pull images and captions into the same space. Gives semantically rich representations.

- **High-resolution adaptation**: if document work matters, add a stage that further adapts the encoder to high-resolution input.

- **Last layer vs intermediate layer**: which layer's features to pass to the projector is also a choice. The last layer tends to be semantic, intermediate layers carry finer visual detail, so they are sometimes combined per task.

The encoder's quality somewhat determines the ceiling of the whole VLM's performance. If the encoder cannot distinguish small text, no matter how smart the following LLM is, it cannot recover that information. So if you target documents and OCR, pay special attention to the resolution and expressiveness at the encoder stage.

Deep Dive 7: Extending to Multiple Images and Video

The principle of processing a single image extends directly to multiple images and video.

- **Multiple images**: turn each image into visual tokens independently, then interleave with text in order of appearance. The model distinguishes by position which text refers to which image.

- **Video**: sample frames at a fixed interval, turn each into visual tokens, and add a time-axis position component to give order. M-RoPE's time component is naturally used here.

video processing flow

[frame1][frame2]...[frameT] --> tokenize each

assign time component t=1,2,...,T --> preserve frame order

[question text] + [frame tokens] --> LLM decoder

Video tokens explode in proportion to frame count, so the key is managing the token budget by jointly adjusting the frame sampling interval and per-frame resolution. The longer the video the sparser you sample frames; for short, detail-critical video you sample densely, a trade-off you tune.

Practical View: What to Watch at Inference

The first thing you hit when serving a vision LLM is visual token cost. A single high-resolution document can take thousands of tokens, so context fills faster than with text alone.

- **Token caps**: arbitrary-resolution models cap max tokens per image to control context and latency.

- **Resolution-cost trade-off**: reading small text needs higher resolution, but that means more tokens, more cost, more latency. Find the balance for your task requirements.

- **Token pruning**: pruning low-importance visual tokens during inference to shorten the sequence is an active research area. Balancing information loss against speed is the key.

- **Preprocessing consistency**: keep the same patch size, normalization, and resize rules at inference as at training to avoid quality regressions.

Visual tokens occupy the KV cache along with LLM decoding. So in multi-turn conversations that repeatedly reference the same image, the cache accumulates and memory pressure grows. Prefix cache reuse or image-embedding caching can ease this.

Deep Dive 11: Additional Considerations From a Serving View

When actually serving a vision LLM, a few details beyond text-LLM serving are added.

- **Image preprocessing cost**: decoding, resizing, and normalization also take time. With many large images, preprocessing can become a bottleneck, so split it into a separate worker or cache it.

- **Variable-length batching**: image sizes vary, so sequence length differs per request. It must fit well with continuous batching to get throughput.

- **Visual token KV cache**: when the same image is repeatedly referenced across turns, you can cache its visual tokens' KV to avoid recomputation.

- **Prefill share**: visual tokens are usually processed all at once in the prefill stage. With large images, prefill lengthens, increasing the latency to the first token.

vision LLM serving pipeline

[request: image + text]

[image preprocessing] (can be a separate worker)

[vision encoder + projector] generate visual tokens in prefill

[LLM prefill + decode] bundle with continuous batching

[response streaming]

The key is that adding the image path makes prefill cost and preprocessing cost larger than text-only. Image size caps, preprocessing separation, and caching are the main knobs of serving efficiency.

Pitfalls and Troubleshooting

- **Resolution mismatch**: quality can collapse on extreme aspect ratios or ultra-high resolutions unseen during training. Normalize input size to the range the model tolerates.

- **Placeholder count mismatch**: if the number of placeholder tokens disagrees with the actual visual token count during interleaving, sequence alignment breaks. Always validate the count in the preprocessing pipeline.

- **Detail loss in compression models**: heavily compressing tokens with a Q-Former-style module smears small text and table cells. If OCR precision matters, consider a non-compressing or high-resolution path.

- **Position-encoding assumption violations**: when position-component computation diverges from training assumptions on video or multi-image input, spatial reasoning falls apart. Following the model's recommended preprocessing routine is the safe path.

- **Normalization statistics mismatch**: failing to normalize with the mean/std the vision encoder expects distorts features. Check the model card's preprocessing spec.

Deep Dive 7.5: The Details of Image Preprocessing

Half of visual token quality is decided in the preprocessing before it enters the model. It is often overlooked but greatly affects the result.

- **Resize interpolation**: fitting an image to the patch grid requires resizing. The interpolation method (bilinear, bicubic, etc.) changes the sharpness of small text.

- **Normalization**: the vision encoder expects input normalized with a specific mean and standard deviation. If these statistics are off, features distort and performance drops.

- **Padding and ratio**: when handling non-square images, you choose whether to fill with padding or keep the ratio with dynamic tokens.

- **Color space**: if RGB channel order or color space differs from training, subtle errors arise.

preprocessing checklist

[ ] same resize rule as training

[ ] same normalization statistics as training (mean/std)

[ ] matching channel order (RGB)

[ ] size divisible by patch size

[ ] matching aspect-ratio handling

If any one of this checklist is off, the model produces worse results than at training even with the same weights. The maxim to suspect preprocessing before the model when debugging is especially valid for vision LLMs.

Deep Dive 8: Fusion Method — Interleaving vs Cross-Attention

So far we centered on interleaving (or decoder fusion), where visual tokens are inserted into the same sequence as text. But that is not the only way to mix vision and language. There are two main families.

method A: interleaving (decoder fusion)

insert visual tokens into the same input sequence as text tokens

the LLM's self-attention processes both together

representative: LLaVA-like, Qwen2-VL-like

method B: cross-attention fusion

insert cross-attention blocks between LLM layers

text via self-attention, image info injected via cross-attention

representative: some gated cross-attention structures

- **Interleaving**: simple to implement, reuses the LLM structure almost as is. But visual tokens directly occupy context length.

- **Cross-attention**: injects visual info via a separate path, so it does not grow the text sequence length. In return, extra blocks go inside the LLM, complicating the structure.

Many recent open models tend to choose interleaving for its simplicity and power. But for high-resolution and video where context cost is large, the advantage of cross-attention-style injection is sometimes revisited. There is no single right answer; it depends on the task's token budget and implementation constraints.

Deep Dive 9: Clearing Up Commonly Confused Concepts

- **Are visual tokens and text tokens in the same embedding space?**: after passing through the projector, they are in a space of the same dimension. That is why the LLM can process both as one sequence. The semantic alignment, though, is built by training.

- **Is the vision encoder the same as OCR?**: no. The vision encoder only emits visual features; it does not explicitly read characters. The character-reading ability is grown with the LLM through training data.

- **Is compressing tokens always a loss?**: it depends on the task. When the big picture matters compression is efficient; when small text matters it is a loss.

- **Is higher resolution always good?**: detail improves but tokens and cost grow, and at ultra-high resolution outside the training distribution quality can actually drop.

- **Why does position encoding matter?**: an image's core is 2D spatial information, but flattening into tokens erases it. Position encoding restores those spatial relations.

Clarifying these concepts lets you quickly grasp which design choices a new VLM's technical report made.

Deep Dive 9.5: A Decision Guide for Model Selection

Let us connect the design axes seen so far to actual model selection. Priorities change with task nature.

recommended direction by task

receipt/document OCR -> high resolution + non-compressing (MLP) projector + arbitrary resolution

general image captioning -> medium resolution + compression allowed, cost efficiency first

chart/table analysis -> high resolution + model trained for structured output

multi-image reasoning -> emphasize token efficiency (compression or patch merge)

video understanding -> frame sampling + time position encoding support

This guide is only a starting point; in practice, directly evaluating a few candidate models on your own data is most certain. Benchmark scores show average tendencies only; how it behaves on my documents and my images must be measured directly.

Ask yourself three core questions. First, do small text or fine details matter? If so, prioritize resolution and the non-compressing path. Second, how many images go into one request? If many, token efficiency governs cost. Third, is the output free text or a structured format? If structured is needed, you must pick a model trained for that format.

Deep Dive 10: Redrawing the Whole Picture

Finally, let us bundle the parts so far back into one big picture.

[original image]

| preprocessing: resize/normalize (same as training)

[patch split + patch embed] B x N_patch x D_vis

| add position embedding

[ViT vision encoder: self-attention x L layers] B x N_patch x D_vis

| (optional) downsample with 2x2 patch merge

[projector: MLP or Q-Former] B x N_v x D_llm

| insert at placeholder positions

[unified sequence: text + visual tokens] interleaving

| assign positions via M-RoPE etc., causal mask

[LLM decoder: self-attention x M layers]

| next-token prediction

[output: text / coordinates / structured]

This single flow is the whole of a vision LLM. Which design you choose at each stage determines the model's character. Encoder resolution, whether the projector compresses, position-encoding method, fusion strategy: turning these four knobs produces countless variants.

Closing

The heart of a vision LLM is, in the end, the transformation that turns an image into tokens. The ViT makes the image a patch sequence, the projector aligns it into the LLM's language space, and interleaving weaves text and visual tokens into one flow. Add devices like arbitrary resolution and M-RoPE, and the model can flexibly handle everything from dense-text documents to wide charts.

Design choices are always trade-offs. Compress tokens and it gets cheaper but loses detail; raise resolution and it reads better but costs more. Knowing what your task values most makes clear which model structure to pick. The next post covers how to train this structure, and how to teach its inputs and outputs.

References

- Qwen2-VL: Enhancing Vision-Language Model's Perception (arXiv: 2409.12191) — [arxiv.org/abs/2409.12191](https://arxiv.org/abs/2409.12191)

- Attention Is All You Need (arXiv: 1706.03762) — [arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)

- FlashAttention: Fast and Memory-Efficient Exact Attention (arXiv: 2205.14135) — [arxiv.org/abs/2205.14135](https://arxiv.org/abs/2205.14135)

- Qwen official repository — [github.com/QwenLM](https://github.com/QwenLM)

- Hugging Face Transformers docs — [huggingface.co/docs](https://huggingface.co/docs)

- PyTorch official docs — [pytorch.org](https://pytorch.org)

- vLLM docs (including multimodal serving) — [docs.vllm.ai](https://docs.vllm.ai)

- vLLM repository — [github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)