- Authors
- Name
- 1. The Unique Nature of LLM Inference
- 2. KV Cache Mechanism and Memory Issues
- 3. Detailed Analysis of PagedAttention
- 4. Continuous Batching vs Static Batching
- 5. vLLM Installation and OpenAI-Compatible API Server Setup
- 6. Key vLLM Features
- 7. TensorRT-LLM Overview and Model Conversion
- 8. Quantization Technique Comparison
- 9. Benchmark Comparison: Throughput, Latency, TTFT
- 10. Brief Comparison with SGLang
- 11. Conclusion
- References
1. The Unique Nature of LLM Inference
Large Language Model (LLM) inference is fundamentally different from traditional deep learning model inference. Understanding this difference is the starting point for optimization.
1.1 Autoregressive Decoding
LLMs generate tokens in an autoregressive manner. That is, they use all previously generated tokens as input to predict the next single token, repeating this process until an End-of-Sequence (EOS) token is produced. This is fundamentally different from models like image classification or object detection, where a single forward pass on the input produces the result.
Input: "The weather today"
Step 1: "The weather today" → "is"
Step 2: "The weather today is" → "nice"
Step 3: "The weather today is nice" → "."
Step 4: "The weather today is nice." → [EOS]
Due to this repetitive nature, LLM inference becomes a Memory-Bound operation. At each step, billions of model parameters must be read from GPU memory, but the actual computation (matrix multiplication for a single token) is relatively small. The bottleneck is GPU Memory Bandwidth rather than computational power (FLOPS).
1.2 Prefill Phase vs Decode Phase
LLM inference is divided into two clearly distinct phases.
Prefill Phase (Prompt Processing)
This phase processes the entire user-input prompt at once. Since input tokens can be processed in parallel, it exhibits Compute-Bound characteristics. During this phase, Key-Value (KV) vectors are computed for all input tokens and stored in cache. The TTFT (Time To First Token) — the latency until the user receives the first token — is primarily determined by the performance of this phase.
Decode Phase (Token Generation)
After prefill, tokens are generated one at a time sequentially. Since only one token is processed per step, the computation is minimal, and the cost of loading model weights from memory is dominant, making it a Memory-Bound operation. The performance of this phase determines the overall Throughput and TPOT (Time Per Output Token).
The fact that these two phases have completely different computational characteristics is the core factor that makes LLM inference optimization challenging. Prefill must maximize GPU compute unit utilization, while Decode must maximize memory bandwidth utilization.
2. KV Cache Mechanism and Memory Issues
2.1 What is KV Cache
In autoregressive decoding, recomputing the attention for all previous tokens from scratch at every step would create enormous redundant computation. To prevent this, Key and Value vectors computed in previous steps are cached in GPU memory — this is the KV Cache.
In Self-Attention at each Transformer layer, three vectors are used: Query (Q), Key (K), and Value (V). When generating a new token, only the Q vector for that token is newly computed, while K and V vectors from previous tokens are retrieved from the cache. This reduces computation from O(n^2) to O(n).
2.2 Memory Consumption of KV Cache
The memory usage of KV Cache is calculated as follows:
KV Cache Memory = 2 x num_layers x num_heads x head_dim x seq_len x batch_size x dtype_size
For example, for the Llama-2-70B model (80 Layers, 64 KV Heads, Head Dim 128) at FP16 precision, the KV Cache for a single sequence (seq_len=2048) is:
2 x 80 x 64 x 128 x 2048 x 2 bytes = approximately 5.24 GB
Over 5GB is needed for just a single sequence. Since this value scales linearly with Batch Size, KV Cache becomes the largest consumer of GPU memory. On an 80GB A100 GPU, if about 35GB is used for model weights (FP16 for a 70B model), only about 40GB remains for KV Cache, which directly limits the number of concurrent requests (Batch Size) that can be processed.
2.3 Memory Waste in Existing Systems
Existing LLM serving systems pre-allocated contiguous memory equal to the maximum length of a sequence for KV Cache. This approach has three severe inefficiencies:
- Internal Fragmentation: When the actual sequence length is shorter than the maximum length, the remaining space is wasted.
- External Fragmentation: As memory blocks of various sizes are allocated and freed, usable but non-contiguous memory fragments appear.
- Reservation Waste: Until generation is complete, memory equal to the maximum length is reserved, locking up actually unused memory.
According to the vLLM paper, 60-80% of KV Cache memory was wasted in existing systems. This is the core problem that PagedAttention set out to solve.
3. Detailed Analysis of PagedAttention
3.1 Core Idea: Operating System Virtual Memory
PagedAttention is a technique proposed in the paper "Efficient Memory Management for Large Language Model Serving with PagedAttention" by Woosuk Kwon et al. at UC Berkeley, presented at SOSP 2023. The core idea is to apply the OS Virtual Memory and Paging technique to KV Cache management.
In an OS, a process's virtual address space is contiguous, but in actual physical memory (RAM), it is stored in fixed-size Pages distributed throughout memory. A Page Table maps virtual addresses to physical addresses. PagedAttention applies this concept directly to KV Cache.
3.2 Block-Based KV Cache Management
PagedAttention divides the KV Cache into fixed-size Blocks. Each Block stores Key and Value vectors for a fixed number of tokens (e.g., 16). The core mechanism is as follows:
- Non-contiguous Memory Allocation: Blocks can be located anywhere in GPU memory. No contiguous memory space is needed.
- Block Table: Each sequence has a Block Table that maps its logical Block numbers to physical Block locations. This is the same concept as an OS Page Table.
- Dynamic Allocation: Blocks are allocated one at a time as needed when tokens are generated. There is no need to pre-reserve the maximum sequence length.
- Immediate Release: When sequence generation is complete, the corresponding Blocks are immediately returned to the Free List.
Sequence "The weather today is nice.":
Logical Block 0 → Physical Block 7 (tokens: "The", "weather", "today")
Logical Block 1 → Physical Block 2 (tokens: "is", "nice", ".")
Block Table:
[0] → 7
[1] → 2
3.3 Memory Efficiency Improvement
PagedAttention virtually eliminates memory waste compared to existing systems.
- Internal Fragmentation: Only occurs in the last Block, averaging less than half the Block Size.
- External Fragmentation: Completely eliminated since all Blocks are the same size.
- Reservation Waste: Eliminated since Blocks are allocated only when needed.
According to the vLLM official documentation, PagedAttention enables near-optimal usage of KV Cache memory, which has the effect of increasing the processable Batch Size by 2-4x or more on the same GPU. An increase in Batch Size directly translates to increased Throughput.
3.4 Memory Sharing
Another advantage of PagedAttention is memory sharing. In Parallel Sampling or Beam Search, when multiple sequences share the same Prompt, the KV Cache Blocks corresponding to the Prompt can be shared at the Block Table level without copying. Using a Copy-on-Write (CoW) approach, new Blocks are allocated only at the point where sequences diverge. This can reduce memory usage by up to 55% in Beam Search.
4. Continuous Batching vs Static Batching
4.1 Limitations of Static Batching
Traditional Static Batching groups multiple requests into a single Batch for simultaneous processing, but waits until all sequences in the Batch are complete before processing the next Batch. The problem is that LLM response lengths vary significantly between requests.
For example, when processing with Batch Size 4, if one request generates only 10 tokens while another generates 500 tokens, the GPU resources for the completed slot remain idle even after the 10-token request finishes, waiting until the 500-token request completes. This inefficiency significantly degrades Throughput in production environments.
4.2 Continuous Batching (Iteration-Level Scheduling)
Continuous Batching performs scheduling at the step (iteration) level. The core operating mechanism is as follows:
- Incoming requests are placed in a Waiting Queue.
- The Scheduler checks the Waiting Queue at every Forward Pass (Generation Step).
- If there is spare capacity in the currently running Batch, new waiting requests are added to the Batch.
- The GPU performs one step of Decoding for all sequences in the current Batch.
- When a sequence completes generation (reaches EOS), its slot resources are released immediately.
- In the next step, new requests fill the empty slots.
Static Batching:
Step 1: [A, B, C, D] ← 4 processed simultaneously
Step 2: [A, B, C, D]
Step 3: [A, _, C, D] ← B done, but slot is empty
Step 4: [A, _, _, D] ← C done, still waiting
Step 5: [_, _, _, D] ← Wait until D finishes, then start new Batch
Continuous Batching:
Step 1: [A, B, C, D]
Step 2: [A, B, C, D]
Step 3: [A, E, C, D] ← B done, E immediately inserted
Step 4: [A, E, F, D] ← C done, F immediately inserted
Step 5: [G, E, F, D] ← A done, G immediately inserted
According to Anyscale's benchmarks, Continuous Batching can achieve up to 23x Throughput improvement over Static Batching, because the GPU is maintained at maximum sequence processing capacity at all times.
4.3 Synergy with PagedAttention
For Continuous Batching to work effectively, KV Cache memory must be managed flexibly in situations where sequences of various lengths dynamically enter and leave. PagedAttention provides fine-grained, Block-level memory allocation/deallocation, perfectly combining with Continuous Batching's dynamic scheduling. The combination of these two technologies is vLLM's core competitive advantage.
5. vLLM Installation and OpenAI-Compatible API Server Setup
5.1 Installation
According to the vLLM official documentation, installation is simple via pip. Python 3.9 or higher and CUDA 12.x are required.
# Installation via pip (recommended)
pip install vllm
# Environment management using uv (officially recommended)
uv pip install vllm
For Docker, you can use the official image provided by vLLM:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct
5.2 Offline Inference (Batch Processing)
Simple batch inference uses the LLM class:
from vllm import LLM, SamplingParams
# Load model
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
# Set sampling parameters
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
# Run batch inference
prompts = [
"The core techniques for LLM inference optimization are",
"Explaining the principle of PagedAttention",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Generated: {output.outputs[0].text}")
5.3 OpenAI-Compatible API Server
vLLM provides an HTTP server compatible with the OpenAI API. It can be started with the vllm serve command:
# Start server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--dtype auto \
--api-key token-abc123 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9
Once the server is running, you can use the OpenAI Python client library as-is:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
# Chat Completions API
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is PagedAttention?"},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
vLLM's OpenAI-compatible server supports endpoints such as /v1/completions, /v1/chat/completions, and /v1/embeddings, offering the significant advantage of replacing the backend without changing existing application code that uses the OpenAI API.
5.4 Key Server Configuration Options
| Option | Description | Default |
|---|---|---|
--model | HuggingFace model name or path | Required |
--dtype | Data type (auto, float16, bfloat16) | auto |
--max-model-len | Maximum sequence length | Model config |
--gpu-memory-utilization | GPU memory usage ratio | 0.9 |
--tensor-parallel-size | Number of GPUs for Tensor Parallelism | 1 |
--pipeline-parallel-size | Number of Pipeline Parallelism stages | 1 |
--quantization | Quantization method (awq, gptq, fp8, etc.) | None |
--enable-prefix-caching | Enable Prefix Caching | False |
--max-num-seqs | Maximum concurrent sequence count | 256 |
6. Key vLLM Features
6.1 Speculative Decoding
Speculative Decoding is a technique where a small, fast Draft Model speculatively generates multiple tokens in advance, and a large Target Model verifies them all at once. Tokens that pass verification are adopted as-is, and regeneration starts from the first incorrect token. This can reduce Inter-Token Latency while maintaining generation quality.
According to the vLLM official documentation, the following Speculative Decoding methods are supported:
- Draft Model: Uses a small model (e.g., Llama-68M) as a draft to speculatively generate multiple tokens
- EAGLE: An efficient neural network-based draft generation method
- MLP Draft Model: A lightweight MLP-based draft generation
- N-gram Speculation: Token prediction based on N-gram patterns in the input text
- Suffix Decoding: A suffix-based prediction strategy
# Running a Speculative Decoding server with a Draft Model
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 \
--use-v2-block-manager
Speculative Decoding is particularly effective in Memory-Bound workloads with low to moderate QPS (Queries Per Second). In high QPS environments, the additional computational overhead from the Draft Model may actually degrade performance.
6.2 Automatic Prefix Caching (APC)
Automatic Prefix Caching is a feature that reuses the KV Cache of identical Prefixes when repeated requests share the same Prefix. For example, when thousands of requests come in with the same System Prompt, the KV Cache for the System Prompt is computed only once and reused for subsequent requests.
In vLLM, Prefix Caching operates at the Block level. When a request arrives, the Prompt's tokens are divided into Blocks, existing cached Blocks are reused, and new Blocks are created only from the point where new tokens appear.
# Enable Prefix Caching
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-prefix-caching
This feature dramatically reduces TTFT in scenarios where the same Prefix is repeated, such as RAG (Retrieval-Augmented Generation), Few-shot Learning, and Multi-turn conversations.
6.3 LoRA Serving
vLLM can dynamically load and serve multiple LoRA (Low-Rank Adaptation) Adapters on a single Base Model. This allows handling various fine-tuned model variants from a single server instance.
# Running server with LoRA Adapter
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--lora-modules my-adapter=./path/to/lora/adapter \
--max-lora-rank 64
When a request specifies the LoRA Adapter name in the model parameter, the result with that Adapter applied is returned. In Tensor Parallelism environments, only half of the LoRA operations are distributed by default, but using the --fully-sharded-loras option provides better performance for long sequences or high ranks.
7. TensorRT-LLM Overview and Model Conversion
7.1 What is TensorRT-LLM
TensorRT-LLM is an LLM inference optimization library developed by NVIDIA, built by specializing the TensorRT engine for LLMs. It defines LLM models through a Python API and compiles them into optimized inference engines for NVIDIA GPUs. Key features include:
- NVIDIA GPU-Specific Optimization: CUDA Kernels optimized for NVIDIA architectures
- FP8/NVFP4 Quantization: Hardware-accelerated quantization support for Hopper (H100), Ada Lovelace, and Blackwell (B200) architectures
- In-flight Batching: TensorRT-LLM implementation of Continuous Batching
- Tensor/Pipeline Parallelism: Multi-GPU distributed inference support
- Paged KV Cache: Memory management similar to PagedAttention
- EAGLE-3 Speculative Decoding: Support for the latest Speculative Decoding techniques
7.2 Model Conversion Workflow
TensorRT-LLM model conversion consists of two steps.
Step 1: Checkpoint Conversion
Convert checkpoints from frameworks like HuggingFace to TensorRT-LLM format.
# Llama model Checkpoint conversion
python convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint_1gpu \
--dtype float16
# With Tensor Parallelism
python convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-70B-Instruct \
--output_dir ./tllm_checkpoint_4gpu_tp4 \
--dtype float16 \
--tp_size 4
Step 2: TensorRT Engine Build
Compile the converted checkpoint into an optimized TensorRT engine using the trtllm-build command.
trtllm-build \
--checkpoint_dir ./tllm_checkpoint_1gpu \
--output_dir ./trt_engines/llama-8b/fp16/1-gpu \
--gemm_plugin auto \
--max_batch_size 64 \
--max_input_len 2048 \
--max_seq_len 4096
7.3 Building via Python API
You can also perform conversion and building programmatically using the Python API.
import tensorrt_llm
from tensorrt_llm import BuildConfig
# Direct conversion and build from HuggingFace model
llama = tensorrt_llm.LLaMAForCausalLM.from_hugging_face(
model_dir="/path/to/Llama-3.1-8B-Instruct",
dtype="float16",
)
# Engine build
build_config = BuildConfig(max_batch_size=64)
engine = tensorrt_llm.build(llama, build_config)
# Save engine
engine.save("./trt_engines/llama-8b")
In this process, CLI tools like convert_checkpoint.py use internal APIs, so you must ensure the TensorRT-LLM version matches the script version in the examples folder.
8. Quantization Technique Comparison
Quantization is a technique that reduces memory usage and computation by representing model weights and activation values at lower precision. Here we compare the most widely used quantization techniques for LLM inference.
8.1 GPTQ (Post-Training Quantization)
GPTQ is a Post-Training Quantization (PTQ) technique that minimizes quantization error by leveraging Hessian information on a per-layer basis.
- Principle: When quantizing weights of each layer, it performs compensation to minimize changes in output activations. The compensation process approximates the inverse of the Hessian matrix to find optimal quantization values.
- Precision: Mainly INT4 Weight-Only (W4A16)
- Advantage: Only requires a calibration dataset, enabling quantization without training
- Disadvantage: Calibration takes tens of minutes, slightly lower accuracy compared to AWQ reported
8.2 AWQ (Activation-Aware Weight Quantization)
AWQ is a quantization technique that determines weight importance based on activation distributions, protecting important weights.
- Principle: Not all weights are equally important — weights connected to channels with large activation magnitudes have a greater impact on model performance. This small fraction of important weights (about 1%) is maintained at higher precision while the rest are quantized more aggressively.
- Precision: INT4 Weight-Only (W4A16) or W4A8
- Advantage: Higher accuracy retention than GPTQ (about 95% Quality Retention), hardware-efficient
- Disadvantage: Requires calibration process
8.3 SqueezeLLM
SqueezeLLM is a technique proposed by UC Berkeley that combines non-uniform quantization with Dense-and-Sparse decomposition.
- Principle: Assigns a separate Lookup Table to each output channel for channel-wise non-uniform quantization. Also separates outlier weights into a Sparse matrix for separate processing. Unlike GPTQ/AWQ which minimize changes in individual layer outputs, SqueezeLLM optimizes to minimize changes in the final model output.
- Precision: INT3, INT4 level
- Advantage: Maintains high accuracy even at extremely low-bit quantization
- Disadvantage: High implementation complexity, Lookup Table computation overhead during inference
8.4 FP8 Quantization
FP8 (8-bit Floating Point) is a quantization format supported at the hardware level in NVIDIA Hopper (H100) and later architectures.
- Principle: Converts FP16/BF16 weights and activations to 8-bit floating point (E4M3 or E5M2). Since Tensor Cores natively support FP8 operations, there is virtually no separate dequantization overhead unlike INT8.
- Precision: W8A8 (both weights and activations at 8-bit)
- Advantage: Highest accuracy retention, fastest calibration (minutes), hardware acceleration on H100/B200
- TensorRT-LLM Performance: On Llama-v2-7B, 1.51x speedup at Batch Size 1 and 1.40x at Batch Size 8 compared to FP16 (NVIDIA official benchmark)
8.5 Quantization Selection Guide
The recommended guidelines from TensorRT-LLM official documentation are as follows:
| Scenario | Recommended Method | Reason |
|---|---|---|
| Small Batch (BS 4 or less) | Weight-Only (W4A16, W8A16) | Memory Bandwidth is the bottleneck, so reducing weight size is effective |
| Large Batch (BS 16 or more) | FP8 (W8A8) preferred | Higher computation makes quantizing both weights and activations beneficial |
| Highest accuracy needed | FP8 | Least accuracy loss |
| Maximum compression needed | INT4 AWQ/GPTQ | Reduces model size by about 75% with 4-bit |
| Using H100/B200 | FP8 or NVFP4 | Native hardware support for best performance |
Using quantized models in vLLM:
# Serving AWQ quantized model
vllm serve TheBloke/Llama-2-7B-AWQ \
--quantization awq \
--dtype auto
# Serving GPTQ quantized model
vllm serve TheBloke/Llama-2-7B-GPTQ \
--quantization gptq \
--dtype auto
# FP8 quantization serving (H100 or above)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--dtype auto
9. Benchmark Comparison: Throughput, Latency, TTFT
9.1 Key Performance Metrics
The core metrics used to evaluate LLM inference performance are as follows:
- Throughput: Number of tokens (tokens/second) or requests (requests/second) that can be processed per second
- Latency: Total elapsed time from request start to response completion
- TTFT (Time To First Token): Time from request start until the first token is generated. Directly affects perceived user response speed
- TPOT (Time Per Output Token): Average time for generating each token after the first. Directly affects streaming output speed
- ITL (Inter-Token Latency): Time interval between token generations
9.2 vLLM vs TensorRT-LLM Comparison
Synthesizing benchmark results from 2025, the two frameworks show different strengths.
Throughput
TensorRT-LLM generally records the highest Throughput, thanks to CUDA Kernels optimized for NVIDIA GPUs and TensorRT engine compilation optimizations. It achieves approximately 180-220 req/sec, while vLLM achieves approximately 120-160 req/sec as the second highest. However, these numbers vary significantly depending on model, GPU, Batch Size, sequence length, and other factors.
TTFT (Time To First Token)
vLLM shows excellent performance in TTFT, stably maintaining 50-80ms levels even as concurrent users increase. TensorRT-LLM is faster at 35-50ms at Low Concurrency, but some reports indicate inferior scaling characteristics at High Concurrency compared to vLLM.
Latency
TensorRT-LLM has an edge in Per-Token Latency at Low Concurrency. Especially on B200 GPUs, TensorRT-LLM outperforms SGLang and vLLM across all metrics.
Overall Comparison
| Metric | vLLM | TensorRT-LLM |
|---|---|---|
| Throughput | High (2nd) | Very High (1st) |
| TTFT (Low QPS) | Excellent | Very Excellent |
| TTFT (High QPS) | Stable | Variable |
| Setup Difficulty | Low | High (build process needed) |
| Model Compatibility | Very Wide | NVIDIA GPU only |
| Flexibility | High (Python API) | Medium (engine rebuild needed) |
| Community/Ecosystem | Very Active | NVIDIA-led |
9.3 Framework Selection Criteria
- Rapid prototyping and flexible serving: vLLM is suitable. You can start with a single pip install command and load HuggingFace models directly for immediate serving.
- Maximum performance for production: TensorRT-LLM is suitable. Although model conversion and engine building take time, the optimized engine performance is superior. The performance gap is maximized especially when leveraging FP8/NVFP4 on the latest NVIDIA GPUs (H100, B200).
- Diverse models and rapid experimentation: vLLM is advantageous. Swapping LoRA Adapters, changing quantization methods, etc. are possible with just a server restart.
- Maximizing latest NVIDIA GPU utilization: TensorRT-LLM is advantageous. It leverages hardware features of Hopper and Blackwell architectures earliest and deepest.
10. Brief Comparison with SGLang
SGLang (Structured Generation Language) is an LLM inference framework developed by UC Berkeley LMSYS, gaining attention following vLLM and TensorRT-LLM.
10.1 Key Differentiator: RadixAttention
SGLang's core technology is RadixAttention. This manages Prefix Caching using a Radix Tree data structure, enabling finer token-level Prefix matching and reuse compared to vLLM's Block-level Prefix Caching.
Since the Radix Tree shares Prefixes from multiple requests in a tree structure, it can efficiently reuse KV Cache from previous turns in Multi-turn conversations. In benchmarks, RadixAttention showed approximately 10% performance improvement over vLLM in large Multi-turn conversations.
10.2 Performance Comparison
Synthesizing 2025 benchmark results:
- Throughput: SGLang achieved approximately 16,200 tokens/sec through RadixAttention, approximately 29% higher Throughput compared to vLLM's approximately 12,500 tokens/sec
- Total Processing Time: For processing 500 Prompts, SGLang 54.2 seconds vs vLLM 58.9 seconds, SGLang about 8% faster
- Multi-turn Scenarios: SGLang shows clear advantage in KV Cache reuse efficiency
10.3 Use Scenarios
| Scenario | Recommended Framework |
|---|---|
| High concurrency, single-turn | vLLM |
| Multi-turn conversation, structured output | SGLang |
| Maximum NVIDIA GPU performance | TensorRT-LLM |
| Quick start, wide model compatibility | vLLM |
| Complex generation logic | SGLang |
SGLang particularly excels in workloads requiring complex Multi-turn interactions, structured outputs (JSON Schema, etc.), and sophisticated generation control. In contrast, vLLM is well-suited for single-turn processing at high concurrency and maximizing Throughput with limited resources.
11. Conclusion
LLM inference optimization is achieved through a combination of multiple technologies, not a single technique. PagedAttention eliminates KV Cache memory waste, Continuous Batching maximizes GPU utilization, and quantization reduces model size and computation. Speculative Decoding improves Per-Token Latency, and Prefix Caching shortens TTFT for repeated Prompts.
vLLM integrates these technologies into an easy-to-use interface, enabling rapid prototyping and flexible serving, while TensorRT-LLM pursues maximum performance through deep optimizations specific to NVIDIA GPUs. SGLang provides differentiated performance in Multi-turn scenarios through efficient KV Cache reuse via RadixAttention.
Framework selection should be made by comprehensively considering service requirements, hardware environment, and operational complexity. Regardless of which framework is chosen, understanding the principles of the core technologies covered in this article will serve as the foundation for proper configuration and tuning.
References
- vLLM Official Documentation
- vLLM PagedAttention Design Document
- vLLM Speculative Decoding Documentation
- vLLM Automatic Prefix Caching Documentation
- vLLM OpenAI Compatible Server Documentation
- vLLM Quickstart Guide
- vLLM GitHub Repository
- Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023)
- NVIDIA TensorRT-LLM Official Documentation
- TensorRT-LLM Build Workflow
- TensorRT-LLM Quantization Guide
- TensorRT-LLM FP8 Quantization
- TensorRT-LLM Checkpoint Format
- TensorRT-LLM GitHub Repository
- NVIDIA TensorRT-LLM Docs (NVIDIA)
- TensorRT-LLM Quantization Examples
- SqueezeLLM: Dense-and-Sparse Quantization (arXiv)
- Continuous Batching and LLM Inference (Anyscale Blog)
- SGLang GitHub Repository
- LLM Inference Benchmarking with TensorRT-LLM (NVIDIA Blog)
- vLLM v0.6.0 Performance Update (vLLM Blog)