Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

The first thing anyone who operates LLM inference notices is "why is this so slow?" The same model that poured out enormous computation during training on the same GPU spits out tokens frustratingly slowly when it actually generates text. This is not because the implementation is wrong, but because of the fundamental nature of the decode operation.

In this post we first make clear why decode is slow, then take a detailed look at speculative decoding, the representative technique for working around that limit. We then cover variants such as Medusa, n-gram, and EAGLE, system-level optimizations such as chunked prefill and prefill/decode disaggregation, and finally the latency versus throughput trade-off along with how to measure it. The goal is to develop a clear sense of "which knob makes what faster."

Why decode is slow

An LLM generates tokens one at a time, autoregressively. To produce a single token, the entire model must be forwarded once. That is, to produce N tokens you must forward N times, and since each forward depends on the immediately preceding token, it is sequential. It cannot be parallelized.

The more fundamental problem is that each forward is memory bound.

When generating 1 token:

- Reads the entire model weights (tens of GB) from memory

- Reads the KV cache

- The actual amount of computation is only that of 1 token

Result: The GPU compute units sit idle, and memory bandwidth is the bottleneck.

No matter how high the compute capability, the memory read speed is the limit.

An important insight emerges here. Since decode is memory bound, doing more useful work while you are reading memory once yields a nearly free gain. Batching (handling several requests at once) and speculative decoding (verifying several tokens at once) both exploit this principle.

How Speculative Decoding Works

The idea behind speculative decoding is simple yet clever. A small, fast "draft model" guesses several next tokens in advance, and a large, accurate "target model" verifies those guesses in parallel with a single forward.

Normal decoding (slow):

target model forward -> token 1

target model forward -> token 2

target model forward -> token 3

(3 forwards)

speculative decoding:

1) The draft model quickly guesses: [t1', t2', t3', t4']

2) The target model verifies all 4 at once with 1 forward

3) Tokens that match from the front are accepted, stopping at the first mismatch

e.g.) t1', t2' accepted, mismatch at t3' -> t3 is corrected by the target

(1 target forward can confirm 2 to 3 tokens)

The key is that verification is parallel. The target model can inspect K guessed tokens simultaneously with a single forward. Since it is memory bound anyway, the cost of reading the weights once is the same, but processing several tokens along the way is a gain. When the guesses are good, several tokens are confirmed with a single forward.

An important guarantee is that speculative decoding does not change the output distribution. The verification step is designed to follow the target model's distribution exactly, so the result is statistically identical to what the target model would have generated on its own. In other words, you gain speed without any quality loss. It varies with the guess acceptance rate and the model combination, but roughly a 2 to 3 times speedup is reported in memory bound situations.

Variants: Medusa, n-gram, EAGLE

Maintaining a separate draft model can be burdensome, so various variants have appeared. We will touch only on the concepts.

- **Medusa**: Without a separate draft model, it attaches several additional prediction heads to the target model. Each head predicts a future token simultaneously, and those candidates are verified in a tree form. The advantage is that there is no separate model to manage.

- **n-gram (lookahead family)**: Without a model, it uses patterns that have frequently appeared in the text so far like a dictionary to guess the next token. It is effective when the patterns are pronounced, such as code or repetitive text.

- **EAGLE**: An approach that makes the draft stage more sophisticated, predicting the next token at the level of the model's intermediate representation (feature) to raise the guess acceptance rate. The higher the acceptance rate, the more tokens are accepted, so the acceleration effect grows.

Their common goal is one thing: to raise the acceptance rate of the guesses so that more tokens are confirmed with a single forward of the target model.

System-level optimization: chunked prefill

If speculative decoding speeds up decode itself, there are also optimizations that overlap prefill and decode better at the system level.

Prefill is compute bound, decode is memory bound. But if you mix the two stages into the same batch, they can fill each other's idle resources. The problem is that when a long prompt's prefill comes in all at once, the decode of other in-progress requests stalls during that time (response latency spikes).

Chunked prefill splits a long prefill into several pieces, processing a portion of the prefill piece together with decode at each step.

Without chunked prefill:

[long prompt prefill all at once] ... other request decode halts meanwhile

-> latency of in-progress requests spikes

With chunked prefill:

step 1: [prefill piece A] + [decode of requests]

step 2: [prefill piece B] + [decode of requests]

step 3: [prefill piece C] + [decode of requests]

-> prefill is streamed through while decode steadily proceeds

This way, even when a long prompt comes in, other users' token generation does not stall, so latency stays stable.

prefill/decode disaggregation

Going one step further, processing prefill and decode on entirely different GPUs (or different instances) is disaggregation.

disaggregation structure:

[prefill-dedicated node] --(KV cache transfer)--> [decode-dedicated node]

prefill node: optimized for compute bound work, short and intense

decode node: optimized for memory bound work, long and continuous

Since the two stages have different resource characteristics, each can be optimized and scaled independently. If the prefill load surges, you scale only the prefill nodes; if there is a lot of long generation, you scale only the decode nodes. The downsides are the cost of transferring the KV cache between nodes and the system complexity. It is an advanced technique to consider when squeezing resource efficiency to the extreme in large-scale serving.

Combining with batching and quantization

The techniques so far are not mutually exclusive. On the contrary, they are most effective when combined.

- **Batching**: Since decode is memory bound, bundle several requests to process several tokens with a single weight read. It is the most basic means of raising throughput.

- **Quantization**: Storing weights and KV at lower precision reduces the amount of memory read, so decode gets faster.

- **speculative decoding**: Confirms several tokens with a single forward.

When used together, their effects tend to multiply. However, they do not add up indefinitely. For example, if the batch is already large so the GPU approaches compute bound, the gain from speculative decoding shrinks. This is because speculative decoding is most effective when memory bound (that is, when the batch is small). Therefore, the combination of techniques must be balanced to fit the workload.

Latency vs throughput trade-off

The most important tension in inference optimization is between latency and throughput. These two often move in opposite directions.

If you increase the batch:

throughput (total tokens per second) increases up

but individual request latency also increases up (waiting for the large batch)

If you decrease the batch:

individual request latency decreases down

but throughput decreases down (filling the GPU less)

Which to prioritize depends on the nature of the service. If one person's response speed matters, as in real-time conversation, prioritize latency; if overall throughput matters, as in large batch processing, prioritize throughput. Since you cannot maximize both at the same time, you must first decide the service's goal and turn the knobs accordingly.

Measurement: TTFT, TPOT, throughput

To optimize, you must first measure properly. There are three core metrics for LLM serving.

TTFT (Time To First Token):

The time from sending a request until the first token comes out.

Mainly governed by prefill speed and the queue.

The perceived "speed at which a response starts" in conversational UX.

TPOT (Time Per Output Token):

After the first token, the average time it takes to produce one token.

Mainly governed by decode speed.

Determines how smooth the streaming is.

Throughput:

The total number of tokens the whole system processes per second.

Governed by batching and concurrency. A measure of cost efficiency.

These three metrics measure different things. Even if TTFT is good, if TPOT is bad, only the first character comes out fast and the rest is sluggish. Even if throughput is high, TTFT can be bad because of a large batch. Therefore, do not look at a single number; look at the three metrics together and judge according to what your own service prioritizes.

The guess acceptance rate governs everything

The gain from speculative decoding depends almost entirely on "how often the guesses are correct." Let us reason about it intuitively. If the draft guesses K at a time and on average a are accepted, then a single forward of the target model confirms on average a+1 tokens (the a accepted plus the 1 corrected).

Intuition for acceleration (conceptual):

draft guesses K -> verified with 1 target forward

if the average accepted count is a -> about (a+1) tokens confirmed per forward

high acceptance (large a) -> many tokens per forward -> large acceleration

low acceptance (small a) -> only the draft cost is paid with little gain

Here you must look at two costs together: the cost of running the draft model, and the cost of the target verifying. If the draft is too heavy, the guessing itself becomes expensive; if too light, the acceptance rate drops. That is why the draft model is usually a much smaller model than the target (e.g., tens of times smaller than the target). The guess length K is also a knob. If you make K too large, the later guesses are almost always wrong and become wasted effort; if too small, few tokens are confirmed at once.

Guess length K trade-off:

K small -> verification is cheap but few tokens per forward

K large -> many potential tokens per forward but the tail almost always misses

-> the appropriate K depends on the draft-target alignment

The key lesson is that speculative decoding is not a magic switch. The acceptance rate varies with the workload and model combination, and if the acceptance rate is low, it is actually a loss. The right thing to do is to measure the acceptance rate on your own traffic before turning it on.

Why it is so effective when memory bound

It is worth digging deeper into the point that speculative decoding is especially effective when the batch is small (when memory bound).

When the batch is small (memory bound):

The GPU compute units are mostly idle

-> even verifying K tokens in parallel in the target forward

the extra computation cost is barely felt (idle resources anyway)

-> the gain from speculative decoding is large

When the batch is large (close to compute bound):

The GPU compute units are already busy

-> the extra computation of verifying K in parallel becomes a real cost

-> the gain shrinks

This is important because it means that even for the same system, the value of speculative decoding changes with the traffic situation. In quiet periods (small batch) it gives a large gain, and in busy periods (large batch) it gives a small gain. Some systems exploit this to dynamically turn speculative decoding on and off depending on the batch size.

Batch size scheduling and the operating point

The practical way to handle the latency-throughput trade-off is to adjust the batch size and the waiting policy. Whether to process a request immediately when it arrives, or to gather requests briefly to form a larger batch, is the knob.

Waiting (batching) policy:

process immediately -> minimal latency, but fills the GPU less so throughput is low

gather briefly -> high throughput, but adds latency equal to the gathering time

Finding the operating point:

Set an acceptable upper bound on TTFT/TPOT as an SLA

-> within that, maximize throughput by growing the batch as much as possible

The key is the order: "set the SLA first, then maximize throughput within that limit." If you chase throughput alone without a latency cap, the user experience collapses; if you chase latency alone while ignoring throughput, GPU costs explode. Finding the balance point of the two metrics is the core task of serving engineering.

Measurement tools and load testing

To measure metrics properly, you must mimic a realistic load. The most common mistake when using a synthetic load is making the input/output lengths of all requests identical. Real traffic has a wide length distribution, and this distribution governs the efficiency of continuous batching.

Realistic load testing checklist:

1) Make the input length distribution similar to reality (from short to long)

2) Vary the output length distribution as well

3) Measure while raising concurrency in stages

4) Check not only p50 but also the p95/p99 tail latency

5) Run long enough to measure the steady state after warmup

The tail latency (p95, p99) is especially important. Even if the average looks good, some users may be experiencing very long response latency. The tail grows when a long prompt blocks other requests, or when the queue occasionally backs up. If you look only at the average, you miss these problems.

The order in which to read the metrics:

Look at "capacity" via throughput

-> look at the "typical experience" via the p50 of TTFT/TPOT

-> look at the "worst experience" via p95/p99

You must look at all three layers to see the true state

Pitfalls and troubleshooting

- **I turned on speculative decoding but it got slower**: The guess acceptance rate is too low, or the draft model is too heavy. Check the balance between acceptance rate and draft cost. If the batch is already large and compute bound, the gain is small.

- **TTFT is erratic**: A long prompt's prefill is blocking other requests. Consider chunked prefill.

- **I raised throughput but user complaints increased**: Growing the batch worsened latency. Look at TPOT and TTFT together and rebalance.

- **The benchmark numbers differ from reality**: The synthetic load's input/output length distribution differs from reality. Measure with real traffic samples.

- **I added disaggregation but it only got more complex**: If the scale is not large enough, the KV transfer cost and complexity offset the gain. First weigh whether you really need that scale.

Closing

Decode being slow is due to the fundamental nature of being memory bound, and almost every inference acceleration technique starts from this fact. Speculative decoding confirms several tokens with a single forward, frugally spending the cost of reading memory once. Medusa, n-gram, and EAGLE are variants that raise the guess acceptance rate, and chunked prefill and disaggregation overlap and divide resources better at the system level.

All of this ultimately rests on the balance between latency and throughput. The right answer varies with the service's goal, and only by measuring TTFT, TPOT, and throughput together can you finally turn the right knob. Inference acceleration is not a parade of flashy techniques, but the work of seeing the bottleneck precisely and choosing the right tool for it.

References

- [vLLM Official Documentation](https://docs.vllm.ai/)

- [vLLM GitHub](https://github.com/vllm-project/vllm)

- [SGLang GitHub](https://github.com/sgl-project/sglang)

- [TensorRT-LLM GitHub](https://github.com/NVIDIA/TensorRT-LLM)

- [Hugging Face Documentation](https://huggingface.co/docs)

- [PyTorch](https://pytorch.org/)

- [Attention Is All You Need (arXiv:1706.03762)](https://arxiv.org/abs/1706.03762)

- [FlashAttention (arXiv:2205.14135)](https://arxiv.org/abs/2205.14135)