- Published on
The Complete Guide to LLM Inference Optimization: vLLM, TensorRT-LLM, Speculative Decoding
- Authors
- Name
- Introduction
- Understanding the LLM Inference Pipeline
- KV Cache Optimization: PagedAttention and FlashAttention
- vLLM: High-Performance LLM Serving Engine
- TensorRT-LLM: NVIDIA-Optimized Inference
- Speculative Decoding: Draft-and-Verify Acceleration
- Inference Engine Comparison (vLLM vs TGI vs TensorRT-LLM)
- Quantization and Inference Optimization (AWQ, GPTQ, FP8)
- Operational Considerations and Troubleshooting
- Failure Scenarios and Recovery Procedures
- Conclusion
- References

Introduction
Optimizing inference is just as critical as training when it comes to Large Language Models (LLMs). No matter how capable a model is, it becomes impractical for production use if inference costs are excessive or response latency is too high. Serving models with 70B or more parameters in production requires a comprehensive approach combining GPU memory management, batching strategies, decoding acceleration, quantization, and other optimization techniques.
Since 2025, the field of LLM inference optimization has undergone rapid advancement. vLLM's PagedAttention has reduced KV cache memory waste to under 4%, TensorRT-LLM 1.0 has achieved peak performance on NVIDIA GPUs with a stabilized PyTorch-based architecture and FP8/NVFP4 quantization, and Speculative Decoding has delivered 2-3x speedups without any loss in output quality.
In this post, we analyze the core bottlenecks of LLM inference and provide a comparative analysis of key technologies — vLLM, TensorRT-LLM, Speculative Decoding, and KV Cache optimization — through practical code examples and benchmarks. We also cover operational best practices and troubleshooting scenarios for production environments, aiming to paint a complete picture of LLM inference optimization.
Understanding the LLM Inference Pipeline
Prefill and Decode Phases
LLM inference can be broadly divided into two phases.
Prefill Phase (Prompt Processing): All tokens from the input prompt are processed in parallel at once to generate the KV cache. This phase is a compute-bound operation, where GPU computational power is the key factor.
Decode Phase (Token Generation): Tokens are generated one at a time in an autoregressive manner. Since the entire KV cache must be read at each step, this is a memory-bound operation. It accounts for the majority of total inference time.
# Conceptual representation of the two phases of the LLM inference pipeline
import torch
import time
def llm_inference_pipeline(model, tokenizer, prompt, max_new_tokens=128):
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
# Phase 1: Prefill - Process the entire input prompt in parallel
prefill_start = time.time()
with torch.no_grad():
outputs = model(input_ids, use_cache=True)
past_key_values = outputs.past_key_values # KV Cache generation
next_token_logits = outputs.logits[:, -1, :]
prefill_time = time.time() - prefill_start
# Phase 2: Decode - Generate tokens one by one autoregressively
decode_start = time.time()
generated_tokens = []
for step in range(max_new_tokens):
next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
generated_tokens.append(next_token.item())
with torch.no_grad():
outputs = model(
next_token,
past_key_values=past_key_values, # Reuse cache
use_cache=True,
)
past_key_values = outputs.past_key_values
next_token_logits = outputs.logits[:, -1, :]
if next_token.item() == tokenizer.eos_token_id:
break
decode_time = time.time() - decode_start
tokens_per_sec = len(generated_tokens) / decode_time
print(f"Prefill time: {prefill_time:.3f}s (input {input_ids.shape[1]} tokens)")
print(f"Decode time: {decode_time:.3f}s ({len(generated_tokens)} tokens generated)")
print(f"Decode speed: {tokens_per_sec:.1f} tokens/s")
return tokenizer.decode(generated_tokens)
Key Performance Metrics
The following metrics are essential when evaluating LLM inference performance.
| Metric | Description | Influencing Factors |
|---|---|---|
| TTFT (Time to First Token) | Latency until the first token | Prefill speed, queue wait time |
| TPOT (Time Per Output Token) | Interval between output tokens | Decode speed, batch size |
| Throughput (tokens/s) | Tokens processed per second | Batching, parallelism, quantization |
| GPU Memory Utilization | GPU memory usage efficiency | KV Cache management, quantization |
| Latency P99 | 99th percentile latency | Overall system stability |
Static Batching vs Continuous Batching
Traditional static batching requires all requests to wait until the longest sequence in the batch completes. This leads to severe GPU resource waste.
Continuous Batching (also known as iteration-level batching) removes completed requests and adds new ones at each decoding step. This significantly improves GPU utilization. Modern serving engines such as vLLM, TGI, and TensorRT-LLM all support continuous batching by default.
KV Cache Optimization: PagedAttention and FlashAttention
The Memory Problem of KV Cache
The attention mechanism in Transformer models requires a KV cache that stores the Key and Value vectors of previous tokens. The size of this cache grows proportionally with sequence length, and in large models it can occupy a significant portion of GPU memory.
For example, with the Llama 3.1 70B model in FP16, the KV cache for a single request alone can consume several gigabytes of memory. In conventional approaches, memory is pre-allocated for the maximum sequence length for each request, resulting in 60-80% memory waste relative to actual usage.
PagedAttention: An Innovation Inspired by Virtual Memory
PagedAttention, introduced by vLLM, applies the operating system's virtual memory paging technique to KV cache management. The key ideas are as follows.
- Block-level Management: The KV cache is divided into fixed-size blocks and stored in non-contiguous physical memory.
- Block Table: A block table manages the mapping between logical blocks and physical blocks.
- Dynamic Allocation: Blocks are allocated only when actually needed, reducing memory waste to under 4%.
- Memory Sharing: During beam search or parallel sampling, KV cache blocks from the same prompt can be shared using a copy-on-write mechanism.
FlashAttention: IO-Optimized Attention
FlashAttention optimizes attention computation by taking GPU memory hierarchy into account.
- Tiling: The attention matrix is divided into small blocks and processed in SRAM
- Kernel Fusion: Softmax and matrix multiplication are merged into a single CUDA kernel
- Recomputation: Intermediate results are not stored but recomputed when needed, minimizing HBM access
FlashAttention-2 achieved approximately 2x performance improvement over the original version, and FlashAttention-3 added FP8 support and asynchronous execution on the Hopper architecture (H100).
# Memory usage comparison between FlashAttention and standard attention
import torch
from flash_attn import flash_attn_func
# Config: batch=4, heads=32, seq_len=4096, head_dim=128
batch_size, num_heads, seq_len, head_dim = 4, 32, 4096, 128
q = torch.randn(batch_size, seq_len, num_heads, head_dim, dtype=torch.float16, device="cuda")
k = torch.randn(batch_size, seq_len, num_heads, head_dim, dtype=torch.float16, device="cuda")
v = torch.randn(batch_size, seq_len, num_heads, head_dim, dtype=torch.float16, device="cuda")
# Standard attention: requires O(N^2) memory (stores entire attention matrix)
# 4096 * 4096 * 32 * 4 * 2 bytes = ~4 GB
# FlashAttention: requires only O(N) memory (processed in tiles)
output = flash_attn_func(q, k, v, causal=True)
# Memory usage grows linearly with sequence length
print(f"FlashAttention output shape: {output.shape}")
vLLM: High-Performance LLM Serving Engine
vLLM Overview and Architecture
vLLM is a high-performance LLM inference and serving engine developed at UC Berkeley. As of 2025, the latest version is v0.17.x, and it integrates various optimization techniques with PagedAttention at its core.
Key features include:
- PagedAttention-based KV cache management
- High GPU utilization through Continuous Batching
- Tensor Parallelism and Pipeline Parallelism support
- Built-in OpenAI-compatible API server
- Support for various quantization formats including AWQ, GPTQ, and FP8
- Prefix Caching for common prompt optimization
vLLM Installation and Basic Usage
# Install vLLM (requires CUDA 12.1 or higher)
pip install vllm
# Launch OpenAI-compatible API server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--enable-prefix-caching \
--dtype auto \
--port 8000
# Test request
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"prompt": "The key to LLM inference optimization is",
"max_tokens": 256,
"temperature": 0.7
}'
Using the vLLM Python API
from vllm import LLM, SamplingParams
# Load model (PagedAttention is automatically applied)
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
gpu_memory_utilization=0.90,
max_model_len=4096,
enable_prefix_caching=True,
quantization="awq", # When using an AWQ quantized model
# enforce_eager=True, # Disable CUDA Graph (for debugging)
)
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
repetition_penalty=1.1,
)
# Batch inference - process multiple prompts simultaneously
prompts = [
"Explain how to optimize GPU node scheduling in Kubernetes.",
"What are the pros and cons of asynchronous programming in Python?",
"Compare the advantages and disadvantages of microservices architecture.",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
tokens_count = len(output.outputs[0].token_ids)
print(f"Prompt: {prompt[:50]}...")
print(f"Generated tokens: {tokens_count}")
print(f"Response: {generated[:200]}...\n")
vLLM Production Configuration Guide
# vllm-deployment.yaml - Kubernetes deployment example
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-70b
labels:
app: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.17.1
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- meta-llama/Llama-3.1-70B-Instruct
- --tensor-parallel-size
- '4'
- --gpu-memory-utilization
- '0.90'
- --max-model-len
- '8192'
- --enable-prefix-caching
- --max-num-seqs
- '256'
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: '4'
requests:
nvidia.com/gpu: '4'
memory: '64Gi'
cpu: '16'
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-80GB
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-server
ports:
- port: 80
targetPort: 8000
type: ClusterIP
TensorRT-LLM: NVIDIA-Optimized Inference
Key Changes in TensorRT-LLM 1.0
TensorRT-LLM 1.0 brought two significant changes. First, the PyTorch-based architecture was stabilized and became the default experience. Second, the LLM API was stabilized. The complex engine build process required in previous versions has been greatly simplified.
Key optimization features include:
- FP8 and NVFP4 quantization support
- Disaggregated Serving
- Wide Expert Parallelism (EP) and other parallelization techniques
- EAGLE-3 and Multi-Token Prediction-based Speculative Decoding
- DeepSeek V3/R1 model support
TensorRT-LLM Installation and Inference
# Install TensorRT-LLM (Docker recommended)
docker pull nvcr.io/nvidia/tensorrt-llm:latest
# Or install via pip
pip install tensorrt-llm
# Llama model checkpoint conversion and engine build
# Step 1: Convert HuggingFace model to TensorRT-LLM format
python convert_checkpoint.py \
--model_dir /models/Llama-3.1-70B-Instruct \
--output_dir /engines/llama-70b-ckpt \
--dtype float16 \
--tp_size 4
# Step 2: Build TensorRT engine
trtllm-build \
--checkpoint_dir /engines/llama-70b-ckpt \
--output_dir /engines/llama-70b-engine \
--gemm_plugin float16 \
--max_batch_size 64 \
--max_input_len 4096 \
--max_seq_len 8192 \
--paged_kv_cache enable \
--use_paged_context_fmha enable \
--multiple_profiles enable
# Step 3: Serve with Triton Inference Server
docker run --gpus all -p 8000:8000 \
-v /engines:/engines \
nvcr.io/nvidia/tritonserver:latest \
tritonserver --model-repository=/engines/model_repo
Using the TensorRT-LLM Python API
import tensorrt_llm
from tensorrt_llm import LLM, SamplingParams, KvCacheConfig
# KV Cache configuration
kv_cache_config = KvCacheConfig(
free_gpu_memory_fraction=0.85,
enable_block_reuse=True,
)
# Load model (can build directly from HuggingFace models)
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
kv_cache_config=kv_cache_config,
)
# Run inference
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
prompts = [
"Explain the most important factors in LLM inference optimization.",
"Compare the advantages and limitations of TensorRT-LLM.",
]
outputs = llm.generate(prompts, sampling_params=sampling_params)
for output in outputs:
print(f"Generated result: {output.outputs[0].text[:200]}")
Speculative Decoding: Draft-and-Verify Acceleration
How It Works
Speculative Decoding is a technique proposed in a 2022 Google paper that achieves faster inference speeds without any loss in output quality.
The core idea is as follows.
- Draft Model: A small, fast model generates (speculates) K tokens in advance.
- Target Model: A large, accurate model verifies all K tokens in a single forward pass in parallel.
- Accept/Reject Decision: By comparing against the target model's probability distribution, matching tokens are accepted, and from the point of mismatch, the target model generates the correct token.
The key insight of this approach is that verification is faster than generation. Verifying K tokens requires only a single forward pass of the target model, so higher acceptance rates lead to greater speedups.
Configuring Speculative Decoding in vLLM
# Enable Speculative Decoding in vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--speculative-model meta-llama/Llama-3.1-8B-Instruct \
--num-speculative-tokens 5 \
--speculative-max-model-len 4096 \
--use-v2-block-manager \
--port 8000
# Script to measure the effect of Speculative Decoding
import time
import requests
import json
import statistics
API_URL = "http://localhost:8000/v1/completions"
def measure_latency(prompt, max_tokens=256, n_requests=10):
"""Measure latency and throughput across multiple requests"""
latencies = []
total_tokens = []
for i in range(n_requests):
payload = {
"model": "meta-llama/Llama-3.1-70B-Instruct",
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": 0.7,
}
start = time.time()
response = requests.post(API_URL, json=payload)
elapsed = time.time() - start
result = response.json()
completion_tokens = result["usage"]["completion_tokens"]
latencies.append(elapsed)
total_tokens.append(completion_tokens)
avg_latency = statistics.mean(latencies)
p99_latency = sorted(latencies)[int(0.99 * len(latencies))]
avg_throughput = statistics.mean(total_tokens) / avg_latency
print(f"Average latency: {avg_latency:.3f}s")
print(f"P99 latency: {p99_latency:.3f}s")
print(f"Average throughput: {avg_throughput:.1f} tokens/s")
return avg_latency, p99_latency, avg_throughput
# Test prompt
test_prompt = "Explain the following topic in detail: microservices architecture"
print("=== Speculative Decoding Benchmark ===")
measure_latency(test_prompt)
Latest Speculative Decoding Techniques
Significant advances have been made in Speculative Decoding between 2025 and 2026.
- EAGLE-3: An advanced speculation technique integrated into TensorRT-LLM that predicts tokens by leveraging the hidden states of the target model itself, without a separate draft model. This eliminates the need for additional draft model memory.
- Multi-Token Prediction (MTP): A method that predicts multiple tokens simultaneously in a single step, adopted by DeepSeek V3.
- TurboSpec: A closed-loop control system that dynamically adjusts speculation parameters at runtime, adapting to workload and hardware conditions.
- Heterogeneous Vocabulary Support: Algorithms have been developed that allow draft and target models to operate without sharing the same vocabulary, broadening the range of draft model choices. Empirical results have reported speedups of up to 2.8x.
Inference Engine Comparison (vLLM vs TGI vs TensorRT-LLM)
Comprehensive Comparison Table
| Category | vLLM (v0.17.x) | TGI (v3.x) | TensorRT-LLM (v1.0) |
|---|---|---|---|
| Developer | UC Berkeley / vLLM Community | Hugging Face | NVIDIA |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| Core Technology | PagedAttention | FlashAttention-2/3, FlashInfer | TensorRT engine optimization |
| Throughput (req/s) | 120-160 | 100-140 | 180-220 |
| TTFT | 50-80ms | 60-90ms | 35-50ms |
| Setup Difficulty | Low (pip install) | Low (Docker) | High (engine build required) |
| Continuous Batching | Supported | Supported | Supported |
| Quantization | AWQ, GPTQ, FP8 | AWQ, GPTQ, BitsAndBytes | FP8, NVFP4, INT8, INT4 |
| Speculative Decoding | Supported | Supported (limited) | EAGLE-3, MTP support |
| Tensor Parallelism | Supported | Supported | Supported |
| Pipeline Parallelism | Supported | Not supported | Supported |
| Prefix Caching | Supported (automatic) | Supported | Supported |
| Long Context | Average | Excellent (13x faster in TGI v3) | Excellent |
| Model Compatibility | Very broad | Broad | NVIDIA GPU only |
| API Compatibility | OpenAI-compatible | Custom API + OpenAI-compatible | Triton-based |
| Community Activity | Very high | High | High |
Recommended Engine by Use Case
When to choose vLLM:
- For rapid prototyping or testing in development environments
- When supporting a wide variety of models (direct loading of HuggingFace models)
- When stable latency is needed under high concurrent connections
- When an OpenAI-compatible API is required
When to choose TGI:
- When tight integration with the Hugging Face ecosystem is needed
- When processing ultra-long contexts of 200K+ tokens (13x speedup in TGI v3)
- When simple Docker-based deployment is preferred
When to choose TensorRT-LLM:
- When absolute peak performance on NVIDIA GPUs is required
- For real-time services where TTFT (Time to First Token) is critically important
- In enterprise environments integrated with NVIDIA Triton Inference Server
- When leveraging the latest quantization techniques such as FP8/NVFP4
Quantization and Inference Optimization (AWQ, GPTQ, FP8)
Quantization Technique Comparison
Quantization reduces memory usage and increases inference speed by lowering the precision of model weights. It is particularly effective in memory-bound environments (small batches).
| Technique | Bits | Memory Savings | Quality Loss | Speedup | Characteristics |
|---|---|---|---|---|---|
| FP16 (base) | 16-bit | - | - | - | Baseline |
| FP8 (W8A8) | 8-bit | ~50% | -2.7% (long context) | 1.5-2x | Native H100 support, no training required |
| AWQ (INT4) | 4-bit | ~75% | -0.2% (long context) | 2-3x | Activation-aware, fast quantization |
| GPTQ (INT4) | 4-bit | ~75% | -1.8% (long context) | 2-3x | Hessian-based optimization, data required |
| NVFP4 | 4-bit | ~75% | Low | 2-3x | TensorRT-LLM only, Blackwell optimized |
Practical Quantization Code
# Example of using AWQ quantized models with vLLM
from vllm import LLM, SamplingParams
# Load AWQ quantized model (use AWQ models directly from HuggingFace)
llm_awq = LLM(
model="TheBloke/Llama-2-70B-Chat-AWQ",
quantization="awq",
tensor_parallel_size=2, # 4-bit, so 2 GPUs are sufficient
gpu_memory_utilization=0.90,
max_model_len=4096,
)
# Load GPTQ quantized model
llm_gptq = LLM(
model="TheBloke/Llama-2-70B-Chat-GPTQ",
quantization="gptq",
tensor_parallel_size=2,
gpu_memory_utilization=0.90,
)
# FP8 quantization (H100 or above recommended)
llm_fp8 = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
quantization="fp8",
tensor_parallel_size=4,
gpu_memory_utilization=0.90,
)
# Performance comparison benchmark
import time
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
test_prompts = ["Explain the pros and cons of microservices architecture."] * 10
for name, model in [("AWQ", llm_awq), ("GPTQ", llm_gptq), ("FP8", llm_fp8)]:
start = time.time()
outputs = model.generate(test_prompts, sampling_params)
elapsed = time.time() - start
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
print(f"{name}: {elapsed:.2f}s, {total_tokens/elapsed:.1f} tokens/s")
Quantization Selection Guide
Here are recommendations for choosing a quantization technique in practice.
- Start with FP8: If you have H100 or newer GPUs, FP8 can be applied without training data and has minimal quality loss.
- Use AWQ when memory is tight: Among INT4 quantization methods, AWQ has the least quality degradation (-0.2%) and fast quantization speed.
- Be aware of batch size effects: At small batch sizes, the workload is memory-bound so quantization is highly effective. At large batch sizes, it becomes compute-bound and INT4-to-FP16 dequantization overhead can diminish the benefits.
Operational Considerations and Troubleshooting
GPU Memory Management
The most common issue in production is GPU OOM (Out of Memory). The following should be checked.
- Setting
gpu-memory-utilizationabove 0.90 makes the system vulnerable to temporary memory spikes. A range of 0.85-0.90 is safe. - Setting
max-model-lenlarger than necessary causes excessive KV cache allocation. Adjust it to match actual usage patterns. - With tensor parallelism, inter-GPU communication buffers also consume memory. Verify NVLink connectivity status.
Key Monitoring Metrics
# GPU monitoring - using nvidia-smi
watch -n 1 nvidia-smi
# Check vLLM metrics (Prometheus format)
curl http://localhost:8000/metrics | grep -E "vllm_(num_requests|gpu_cache|avg_generation)"
# Key monitoring targets:
# - vllm:num_requests_running: Number of requests currently being processed
# - vllm:num_requests_waiting: Number of requests in queue
# - vllm:gpu_cache_usage_perc: GPU KV cache usage percentage
# - vllm:avg_generation_throughput_toks_per_s: Average token generation throughput
CUDA Graph and Memory Trade-offs
vLLM uses CUDA Graphs to reduce kernel launch overhead. However, CUDA Graphs consume additional GPU memory. If memory is tight, you can disable them with the --enforce-eager option, but this will reduce throughput.
Request Timeouts and Queue Management
You need to prevent long-running requests from blocking the entire system. Use the --max-num-seqs option to limit the number of concurrent requests, and set timeouts at the proxy level.
Failure Scenarios and Recovery Procedures
Case 1: Request Rejection Due to KV Cache Memory Exhaustion
Symptoms: As gpu_cache_usage_perc approaches 100%, new requests wait indefinitely in the queue or are rejected.
Cause: KV cache space is exhausted due to a surge of long-input requests or a spike in concurrent request volume.
Recovery Procedure:
- Reduce
max-num-seqsto limit concurrent requests. - Reduce
max-model-lento match actual usage patterns. - Apply quantization if necessary to reduce memory footprint of model weights.
- Enable Prefix Caching to share KV cache for common system prompts.
Case 2: TensorRT-LLM Engine Build Failure
Symptoms: OOM errors or compatibility errors during the trtllm-build process.
Cause: Significant GPU memory is required even during builds, and build parameters may not match hardware specifications.
Recovery Procedure:
- Reduce
--max_batch_sizeand--max_input_lenvalues and retry the build. - Reduce build parallelism with the
--workersoption. - Verify compatibility of GPU driver, CUDA, and TensorRT versions.
- Use Docker images to eliminate environment dependency issues.
Case 3: Low Speculative Decoding Acceptance Rate
Symptoms: Latency actually increases after applying Speculative Decoding.
Cause: When the draft model's prediction accuracy is low and most tokens are rejected, the additional computation from the draft model becomes pure overhead.
Recovery Procedure:
- Reduce
num-speculative-tokens(from 5 or below to 3 or below). - Replace the draft model with a smaller model from the same family as the target model (e.g., 8B draft for a 70B target).
- Monitor the acceptance rate metric and consider changing the draft model or disabling Speculative Decoding if it falls below 60%.
- Consider switching to the EAGLE approach to eliminate draft model memory overhead.
Case 4: Load Balancing Imbalance
Symptoms: Load concentrates on specific instances among multiple vLLM instances.
Cause: Simple round-robin load balancing fails to account for differences in request lengths.
Recovery Procedure:
- Use a Least-Connection or Weighted Round Robin load balancer.
- Distribute requests based on each instance's
num_requests_runningmetric. - Consider separating instances dedicated to long-context requests from those handling short requests.
Conclusion
LLM inference optimization is not a single technique but a comprehensive engineering effort that combines optimizations across multiple layers. KV cache management (PagedAttention), computation optimization (FlashAttention), decoding acceleration (Speculative Decoding), precision optimization (quantization), and system optimization (Continuous Batching, Tensor Parallelism) must work together organically to achieve optimal performance.
Here are the key lessons from real-world experience.
- Start with vLLM: It is easy to install, has a large community, and provides sufficient performance for most scenarios.
- Consider TensorRT-LLM when TTFT is the top priority: Setup is more complex, but it achieves the best latency performance on NVIDIA GPUs.
- Start quantization with FP8: It has minimal quality loss and is simple to configure. Only go down to INT4 (AWQ) when memory is insufficient.
- Monitor acceptance rate with Speculative Decoding: If it drops below 60%, it is not effective. Draft model selection is the key factor.
- Optimization without monitoring is blind: Always track GPU cache usage, TTFT, throughput, and P99 latency.
LLM inference optimization technology is evolving rapidly. vLLM is preparing its v2 architecture, TensorRT-LLM is strengthening optimizations for next-generation Blackwell GPUs, and Speculative Decoding is advancing with heterogeneous vocabulary support and adaptive control systems. It is important to continuously follow these trends and find the optimal combination for your specific workload.
References
- vLLM Official Docs and GitHub - PagedAttention implementation and latest releases
- TensorRT-LLM GitHub - NVIDIA's optimized inference engine and model conversion guide
- NVIDIA Mastering LLM Techniques: Inference Optimization - Comprehensive guide to LLM inference optimization techniques
- FlashAttention GitHub (Dao-AILab) - FlashAttention implementation and benchmarks
- Google Research: Looking Back at Speculative Decoding - Principles and evolution of Speculative Decoding
- Hugging Face Text Generation Inference - TGI official documentation
- vLLM vs TensorRT-LLM vs TGI Benchmark Comparison (MarkTechPost) - Inference engine performance comparison
- BentoML LLM Inference Handbook: Speculative Decoding - Practical guide to Speculative Decoding