💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

Training an LLM and serving one are completely different problems. Training only needs to be done well once, but serving must take in user requests every moment and emit tokens. Even with the same GPU, throughput can differ by several times depending on how you compose the serving stack. With a single H100, some people produce hundreds of tokens per second, while others produce thousands.

What makes that difference is the inference serving engine. As of 2026, the three most frequently discussed in practice are vLLM, SGLang, and TensorRT-LLM. In this article, we first look at why LLM inference is so tricky at a fundamental level, compare the philosophies each of the three engines uses to solve the problem, and then organize what to actually choose and how to deploy it.

The goal of this article is not to say that one thing is unconditionally the answer. That is because the right answer changes depending on the nature of the workload. Establishing the criteria for that judgment is the key.

Core principle: inference happens in two stages

The process by which an LLM generates text is split into two stages with completely different characteristics. If you do not understand this distinction, you miss half of serving optimization.

The prefill stage

Prefill is the stage that processes the entire input prompt at once. If the prompt the user sent is 1,000 tokens, these 1,000 tokens are passed through attention simultaneously to compute the KV (key/value) at each position and produce the first output token.

Prefill can perform as much computation in parallel as the number of input tokens. In other words, it can fill the GPU's compute units, so it has a strong **compute-bound** character. This is because the matrix multiplications happen in a large and dense manner.

The decode stage

Decode is the stage that generates tokens one at a time. To make one token, you must read the KV of all tokens generated so far and compute attention. Yet at each step, only a single new token is processed.

The problem lies here. To make one token, you must read the model's entire weights (tens of GB) and the KV cache accumulated so far from memory, but the actual amount of computation is only for that single token. In other words, it is **memory-bound**. Most of the GPU's compute units sit idle, and memory bandwidth becomes the bottleneck.

prefill: process N input tokens at once -> compute-bound (GPU compute units saturated)

decode: generate 1 token at a time -> memory-bound (reading weights/KV is the bottleneck)

life of a request:

[prompt] --prefill--> [first token] --decode--> [token] --decode--> ... [EOS]

This difference in nature between the two stages determines everything about serving optimization. Because decode is memory-bound, it is so important to batch multiple requests together to fill the GPU, to reduce the amount of memory reads through quantization, and to manage the KV cache efficiently.

Going deeper 1: continuous batching

Traditional batching (static batching) gathers multiple requests into one batch and processes them together, then waits until all requests in the batch finish before starting the next batch. The problem is that generation length varies wildly per request. One request makes 10 tokens, another makes 1,000. In static batching, even when short requests finish early, the GPU sits idle waiting for the long requests.

Continuous batching (also called in-flight batching) solves this problem elegantly. It dynamically reconstructs the batch at every decode step. Finished requests are immediately removed from the batch, and waiting new requests are inserted in their place.

static batching (waste occurs):

step-> 1 2 3 4 5 6 7 8

req A ■ ■ ✓ . . . . . <- finished at step 3 but the slot is held until 8

req B ■ ■ ■ ■ ■ ■ ■ ✓

req C ■ ■ ■ ✓ . . . . <- the empty slot leaves the GPU idle

continuous batching (slots reused immediately):

step-> 1 2 3 4 5 6 7 8

req A ■ ■ ✓

req D ■ ■ ■ ✓ <- D enters the spot A vacated

req B ■ ■ ■ ■ ■ ■ ■ ✓

req C ■ ■ ■ ✓

req E ■ ■ ■ ■ <- E enters the spot C vacated

As of 2026, continuous batching is a standard feature of all major serving engines. It is the most basic yet powerful technique for raising throughput.

Going deeper 2: paged KV cache

In the decode stage, the KV cache keeps growing as the sequence gets longer. The traditional approach reserves a contiguous block of memory equal to the "maximum length" in advance for each request. But the actual generation length cannot be known ahead of time. If you reserved up to 2,048 tokens but only generated 50 tokens in reality, the remaining space is wasted wholesale.

PagedAttention applies the operating system's virtual memory concept to the KV cache. It divides the KV cache into small fixed-size blocks (pages) and allocates blocks as needed. They may be logically contiguous but physically scattered. A block table maps logical positions to physical positions.

traditional approach (contiguous pre-allocation):

req A: [■■■□□□□□□□□□□□□□] <- reserve 16 slots, use only 3, waste 13

req B: [■■■■■□□□□□□□□□□□] <- reserve 16 slots, use only 5

paged approach (dynamic allocation per block):

block pool: [b0][b1][b2][b3][b4][b5]...

req A block table: b0 -> b3 (only as much as needed)

req B block table: b1 -> b2 -> b4 (physically scattered is OK)

This approach has two effects. First, memory fragmentation nearly disappears, so the same GPU memory can handle far more concurrent requests. Second, because management is per block, when multiple requests share the same prefix, that block can be shared. PagedAttention is the technique vLLM first popularized, and it has now effectively become standard.

Framework comparison

Now let us look at the three engines one by one. Each has a different starting point and strengths.

vLLM

vLLM is the project that introduced PagedAttention to the world, and it is the most general-purpose choice. It quickly supports a wide range of model architectures and runs on diverse hardware (not only NVIDIA but also AMD and other accelerators). It ships an OpenAI-compatible API server by default, making integration easy.

Its hallmark is "balance." Rather than being extremely optimized for one particular workload, it delivers good performance broadly across many situations. When a new model comes out, it is often the first to support it, so it is reliable when you need to quickly stand up the latest model. Compared to a well-tuned TensorRT-LLM, its throughput may be somewhat lower in certain scenarios, but that difference is usually not large.

TensorRT-LLM

TensorRT-LLM is a compilation-based engine made by NVIDIA. It pre-compiles the model into an engine optimized for NVIDIA GPUs, pushing kernel fusion and precision optimization to the extreme. As a result, with well-supported models and properly tuned settings, it has been reported to show roughly 15-30% higher throughput than vLLM on NVIDIA GPUs such as the H100 (the variance is large depending on model and settings).

The price is flexibility. A step to compile the engine is required, and support for new models or unusual configurations may not be as fast as vLLM. It is also tied to the NVIDIA ecosystem. If your goal is "to run a fixed model at maximum efficiency on NVIDIA GPUs for a long time," it is a strong choice.

SGLang

SGLang maximizes prefix cache reuse with a technique called RadixAttention. When multiple requests share a common front portion (a system prompt, few-shot examples, multi-turn conversation history), it reuses the KV computation for that portion. RadixAttention manages prefixes in a radix tree to automatically find the shareable portions.

So SGLang shines especially in multi-turn conversations or agent workloads that repeatedly use the same system prompt. In such environments, thanks to prefix reuse, there are cases where you gain roughly 10-20% additional benefit (it depends heavily on the proportion of shared prefix). It is also strong at structured generation and complex prompt programs.

Comparison table

| --- | --- | --- | --- |

Selection guide

Looking at the table alone can be daunting, so let us organize it by real situations.

- **If you need to stand up quickly and experiment with diverse models**: vLLM. Integration is easy and almost all models run right away.

- **If your model is fixed and you must squeeze out maximum cost efficiency on NVIDIA GPUs**: TensorRT-LLM. It is worth bearing the compilation cost.

- **If prefixes overlap heavily, as in a multi-turn chatbot or agent**: SGLang. The effect of RadixAttention is large.

- **If you are unsure and just need to start**: Begin with vLLM, and once the bottleneck becomes clear, benchmark other engines.

The important thing is to measure directly with your own workload. Someone else's benchmark numbers are only a starting point; results change greatly depending on the input/output length distribution and the concurrency level.

In practice: deployment config examples

The following is an example of standing up vLLM's OpenAI-compatible server.

python -m vllm.entrypoints.openai.api_server \

--model meta-llama/Llama-3.1-8B-Instruct \

--max-model-len 8192 \

--gpu-memory-utilization 0.90 \

--max-num-seqs 256 \

--enable-chunked-prefill

If you deploy to Kubernetes, it takes the form of the following Deployment.

apiVersion: apps/v1

kind: Deployment

metadata:

name: vllm-llama

spec:

replicas: 2

selector:

matchLabels:

app: vllm-llama

template:

metadata:

labels:

app: vllm-llama

spec:

containers:

- name: vllm

image: vllm/vllm-openai:latest

args:

- "--model"

- "meta-llama/Llama-3.1-8B-Instruct"

- "--gpu-memory-utilization"

- "0.90"

- "--max-num-seqs"

- "256"

resources:

limits:

nvidia.com/gpu: "1"

ports:

- containerPort: 8000

The SGLang server can be stood up similarly.

python -m sglang.launch_server \

--model-path meta-llama/Llama-3.1-8B-Instruct \

--mem-fraction-static 0.85 \

--context-length 8192

There are three key tuning parameters in common. The GPU memory utilization ratio (the higher it is, the more KV cache space there is, increasing concurrency), the maximum number of concurrent sequences, and the maximum context length. If you raise the memory utilization ratio too high, a sudden long request can cause an OOM, so you must leave headroom.

Going deeper 3: quantization and serving

Recalling once more that decode is memory-bound, every technique that reduces the amount of data read from memory leads directly to a speed improvement. Quantization is exactly that. If you store weights at a lower precision instead of FP16, the memory bandwidth required to read the same weights decreases.

weight memory by precision (assuming an 8B model, approximate):

FP16 : 2 bytes per element -> about 16 GB

INT8 : 1 byte per element -> about 8 GB

INT4 : 0.5 bytes per element -> about 4 GB

since decode reads weights at every token,

if the data to read is halved, the memory bottleneck is eased by half.

As of 2026, the precisions frequently used in serving are FP8 and INT4. FP8 receives hardware support from H100-and-later generation GPUs, so it has small accuracy loss yet a large throughput gain. INT4 saves the most memory, but its precision loss is greater, so you must validate quality depending on the task.

intuition for choosing quantization:

quality first -> FP16 or FP8

memory/cost first -> INT4 (quality validation required)

balance -> FP8 (advantageous when hardware-supported)

An important point is that quantization applies not only to weights but also to the KV cache. With long contexts and high concurrency, the KV cache takes up a large share of memory, so storing KV in FP8 can push concurrency even higher. However, weight quantization and KV quantization are separate settings, and turning both on tends to make the effects accumulate.

Going deeper 4: parallelism — when the model does not fit on a single GPU

If the model is so large that it does not fit in the memory of a single GPU, you must split the model across multiple GPUs. The two methods frequently used in serving are tensor parallel and pipeline parallel.

tensor parallel (TP):

split a layer's weight matrix horizontally across multiple GPUs.

each GPU does a partial computation and combines the results (all-reduce communication).

GPU-to-GPU communication is frequent, so a fast interconnect (NVLink) matters.

[GPU0: half the weights] --combine-- [GPU1: half the weights]

the two split and compute the same layer

pipeline parallel (PP):

divide the layers into groups and assign different layers to each GPU.

GPU0 handles the front layers, GPU1 handles the back layers.

communication is small but a pipeline bubble (idle region) can occur.

[GPU0: layers 1-16] --pass--> [GPU1: layers 17-32]

The practical intuition is as follows. When the interconnect is fast, as among GPUs within a single node, tensor parallel is advantageous. Even though communication is frequent, a fast link supports it. When the link is slow, as when crossing nodes, consider pipeline parallel with little communication or a combination of the two methods. Most serving engines let you specify the tensor parallel degree (e.g., TP=2, TP=4) in a single config line.

serve a large model with tensor parallel 4 in vLLM

python -m vllm.entrypoints.openai.api_server \

--model meta-llama/Llama-3.1-70B-Instruct \

--tensor-parallel-size 4 \

--gpu-memory-utilization 0.90

Parallelism is not free. Communication overhead is added, so if the model fits on a single card, it is faster not to split it. Parallelism is a tool for "when it is unavoidably large and must be split" or "when adding more GPUs to reduce latency."

In practice: autoscaling and observability

Once you put serving into production, single-instance performance is not the only concern. Traffic fluctuates by time of day, and you must scale instances up and down accordingly to control cost.

signals that serve as the basis for autoscaling:

- GPU utilization / KV cache occupancy

- queue length (number of pending requests)

- whether TTFT (time to first token) exceeds the SLA

example scale-out trigger:

KV cache occupancy > 85% sustained for a certain period -> instances +1

The point to be careful about here is that an LLM instance takes time to start up. Loading model weights into memory and initializing the engine can take from tens of seconds to several minutes. Therefore, scaling after traffic has surged is too late. You need a design with predictive scaling or sufficient headroom.

On the observability side, you must continuously track TTFT, time per token, and throughput mentioned earlier on a dashboard. You should also look at the request failure rate, OOM occurrences, and queue backlog together to catch problems early.

key panels of a serving dashboard:

1) TTFT distribution (p50, p95, p99)

2) time per output token (TPOT) distribution

3) throughput (tokens per second)

4) KV cache occupancy / GPU utilization

5) queue length / failure rate

Workload sizing: how to estimate throughput

Once you have chosen an engine and decided on a configuration, you need to estimate how much you can actually accept. Exact numbers can only be known through measurement, but a way of thinking that estimates a rough upper bound is useful.

intuition for the concurrency upper bound:

usable KV memory / KV memory per request = upper bound on concurrent requests

e.g.) usable KV memory = 40 GB

average KV per request (4K context basis) = 0.5 GB

-> upper bound on concurrency about 80

if you extend the context to 32K, KV per request becomes 8x -> about 10 concurrent

Such an estimate is not accurate, but it lets you feel the trade-off "extending the context sharply reduces concurrency" as a number. In real operations, the standard practice is to gradually raise concurrency through load testing and find the inflection point of TTFT and throughput. If you push concurrency beyond the inflection point, throughput stalls and only latency gets worse.

operating point found through load testing:

measure while raising concurrency up

-> throughput climbs up to a certain point, then stalls

-> beyond that point, only latency worsens

-> just before the stall is the efficient operating point

Pitfalls and troubleshooting

- **OOM (out of memory)**: It is common to set gpu-memory-utilization too high or to make max-model-len excessive. Assume the worst case where all concurrent requests decode to the maximum length, and leave headroom.

- **Throughput is good but latency is bad**: This is when the batch is too large and individual requests respond slowly. Remember that throughput and latency are in a trade-off relationship.

- **TensorRT-LLM compilation fails**: The model structure or precision setting may be outside the supported range. Check the support matrix first.

- **Using SGLang but the prefix cache has no effect**: If requests do not actually share a prefix, there is no benefit from RadixAttention. Inspect the workload structure, such as unifying the system prompt.

- **The benchmark differs from reality**: This is when the input/output distribution of the synthetic load differs from real traffic. Measure with real traffic samples as much as possible.

Closing

The core of LLM inference serving ultimately boils down to two things. First, accepting the fact that decode is memory-bound and attacking the memory bottleneck with batching, quantization, and KV cache optimization. Second, choosing an engine that fits the nature of your workload.

vLLM is a balanced general-purpose default, TensorRT-LLM is for extreme throughput of a fixed model, and SGLang is the strong one for prefix-sharing workloads. None is the absolute answer, and the numbers measured with your own traffic should be the basis for the final judgment. Inference serving is a rapidly evolving field, so we recommend keeping up with the official documentation consistently.

References

- [vLLM official docs](https://docs.vllm.ai/)

- [vLLM GitHub](https://github.com/vllm-project/vllm)

- [SGLang GitHub](https://github.com/sgl-project/sglang)

- [TensorRT-LLM GitHub](https://github.com/NVIDIA/TensorRT-LLM)

- [Hugging Face docs](https://huggingface.co/docs)

- [PyTorch](https://pytorch.org/)

- [Attention Is All You Need (arXiv:1706.03762)](https://arxiv.org/abs/1706.03762)

- [FlashAttention (arXiv:2205.14135)](https://arxiv.org/abs/2205.14135)