Skip to content
Published on

The Complete Guide to LLM Inference Optimization: vLLM, TensorRT-LLM, Speculative Decoding

Authors
  • Name
    Twitter
The Complete Guide to LLM Inference Optimization

Introduction

Optimizing inference is just as critical as training when it comes to Large Language Models (LLMs). No matter how capable a model is, it becomes impractical for production use if inference costs are excessive or response latency is too high. Serving models with 70B or more parameters in production requires a comprehensive approach combining GPU memory management, batching strategies, decoding acceleration, quantization, and other optimization techniques.

Since 2025, the field of LLM inference optimization has undergone rapid advancement. vLLM's PagedAttention has reduced KV cache memory waste to under 4%, TensorRT-LLM 1.0 has achieved peak performance on NVIDIA GPUs with a stabilized PyTorch-based architecture and FP8/NVFP4 quantization, and Speculative Decoding has delivered 2-3x speedups without any loss in output quality.

In this post, we analyze the core bottlenecks of LLM inference and provide a comparative analysis of key technologies — vLLM, TensorRT-LLM, Speculative Decoding, and KV Cache optimization — through practical code examples and benchmarks. We also cover operational best practices and troubleshooting scenarios for production environments, aiming to paint a complete picture of LLM inference optimization.

Understanding the LLM Inference Pipeline

Prefill and Decode Phases

LLM inference can be broadly divided into two phases.

Prefill Phase (Prompt Processing): All tokens from the input prompt are processed in parallel at once to generate the KV cache. This phase is a compute-bound operation, where GPU computational power is the key factor.

Decode Phase (Token Generation): Tokens are generated one at a time in an autoregressive manner. Since the entire KV cache must be read at each step, this is a memory-bound operation. It accounts for the majority of total inference time.

# Conceptual representation of the two phases of the LLM inference pipeline
import torch
import time

def llm_inference_pipeline(model, tokenizer, prompt, max_new_tokens=128):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

    # Phase 1: Prefill - Process the entire input prompt in parallel
    prefill_start = time.time()
    with torch.no_grad():
        outputs = model(input_ids, use_cache=True)
        past_key_values = outputs.past_key_values  # KV Cache generation
        next_token_logits = outputs.logits[:, -1, :]
    prefill_time = time.time() - prefill_start

    # Phase 2: Decode - Generate tokens one by one autoregressively
    decode_start = time.time()
    generated_tokens = []
    for step in range(max_new_tokens):
        next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
        generated_tokens.append(next_token.item())

        with torch.no_grad():
            outputs = model(
                next_token,
                past_key_values=past_key_values,  # Reuse cache
                use_cache=True,
            )
            past_key_values = outputs.past_key_values
            next_token_logits = outputs.logits[:, -1, :]

        if next_token.item() == tokenizer.eos_token_id:
            break

    decode_time = time.time() - decode_start
    tokens_per_sec = len(generated_tokens) / decode_time

    print(f"Prefill time: {prefill_time:.3f}s (input {input_ids.shape[1]} tokens)")
    print(f"Decode time: {decode_time:.3f}s ({len(generated_tokens)} tokens generated)")
    print(f"Decode speed: {tokens_per_sec:.1f} tokens/s")

    return tokenizer.decode(generated_tokens)

Key Performance Metrics

The following metrics are essential when evaluating LLM inference performance.

MetricDescriptionInfluencing Factors
TTFT (Time to First Token)Latency until the first tokenPrefill speed, queue wait time
TPOT (Time Per Output Token)Interval between output tokensDecode speed, batch size
Throughput (tokens/s)Tokens processed per secondBatching, parallelism, quantization
GPU Memory UtilizationGPU memory usage efficiencyKV Cache management, quantization
Latency P9999th percentile latencyOverall system stability

Static Batching vs Continuous Batching

Traditional static batching requires all requests to wait until the longest sequence in the batch completes. This leads to severe GPU resource waste.

Continuous Batching (also known as iteration-level batching) removes completed requests and adds new ones at each decoding step. This significantly improves GPU utilization. Modern serving engines such as vLLM, TGI, and TensorRT-LLM all support continuous batching by default.

KV Cache Optimization: PagedAttention and FlashAttention

The Memory Problem of KV Cache

The attention mechanism in Transformer models requires a KV cache that stores the Key and Value vectors of previous tokens. The size of this cache grows proportionally with sequence length, and in large models it can occupy a significant portion of GPU memory.

For example, with the Llama 3.1 70B model in FP16, the KV cache for a single request alone can consume several gigabytes of memory. In conventional approaches, memory is pre-allocated for the maximum sequence length for each request, resulting in 60-80% memory waste relative to actual usage.

PagedAttention: An Innovation Inspired by Virtual Memory

PagedAttention, introduced by vLLM, applies the operating system's virtual memory paging technique to KV cache management. The key ideas are as follows.

  1. Block-level Management: The KV cache is divided into fixed-size blocks and stored in non-contiguous physical memory.
  2. Block Table: A block table manages the mapping between logical blocks and physical blocks.
  3. Dynamic Allocation: Blocks are allocated only when actually needed, reducing memory waste to under 4%.
  4. Memory Sharing: During beam search or parallel sampling, KV cache blocks from the same prompt can be shared using a copy-on-write mechanism.

FlashAttention: IO-Optimized Attention

FlashAttention optimizes attention computation by taking GPU memory hierarchy into account.

  • Tiling: The attention matrix is divided into small blocks and processed in SRAM
  • Kernel Fusion: Softmax and matrix multiplication are merged into a single CUDA kernel
  • Recomputation: Intermediate results are not stored but recomputed when needed, minimizing HBM access

FlashAttention-2 achieved approximately 2x performance improvement over the original version, and FlashAttention-3 added FP8 support and asynchronous execution on the Hopper architecture (H100).

# Memory usage comparison between FlashAttention and standard attention
import torch
from flash_attn import flash_attn_func

# Config: batch=4, heads=32, seq_len=4096, head_dim=128
batch_size, num_heads, seq_len, head_dim = 4, 32, 4096, 128

q = torch.randn(batch_size, seq_len, num_heads, head_dim, dtype=torch.float16, device="cuda")
k = torch.randn(batch_size, seq_len, num_heads, head_dim, dtype=torch.float16, device="cuda")
v = torch.randn(batch_size, seq_len, num_heads, head_dim, dtype=torch.float16, device="cuda")

# Standard attention: requires O(N^2) memory (stores entire attention matrix)
# 4096 * 4096 * 32 * 4 * 2 bytes = ~4 GB

# FlashAttention: requires only O(N) memory (processed in tiles)
output = flash_attn_func(q, k, v, causal=True)
# Memory usage grows linearly with sequence length
print(f"FlashAttention output shape: {output.shape}")

vLLM: High-Performance LLM Serving Engine

vLLM Overview and Architecture

vLLM is a high-performance LLM inference and serving engine developed at UC Berkeley. As of 2025, the latest version is v0.17.x, and it integrates various optimization techniques with PagedAttention at its core.

Key features include:

  • PagedAttention-based KV cache management
  • High GPU utilization through Continuous Batching
  • Tensor Parallelism and Pipeline Parallelism support
  • Built-in OpenAI-compatible API server
  • Support for various quantization formats including AWQ, GPTQ, and FP8
  • Prefix Caching for common prompt optimization

vLLM Installation and Basic Usage

# Install vLLM (requires CUDA 12.1 or higher)
pip install vllm

# Launch OpenAI-compatible API server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --enable-prefix-caching \
    --dtype auto \
    --port 8000

# Test request
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-70B-Instruct",
        "prompt": "The key to LLM inference optimization is",
        "max_tokens": 256,
        "temperature": 0.7
    }'

Using the vLLM Python API

from vllm import LLM, SamplingParams

# Load model (PagedAttention is automatically applied)
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.90,
    max_model_len=4096,
    enable_prefix_caching=True,
    quantization="awq",            # When using an AWQ quantized model
    # enforce_eager=True,           # Disable CUDA Graph (for debugging)
)

# Configure sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
    repetition_penalty=1.1,
)

# Batch inference - process multiple prompts simultaneously
prompts = [
    "Explain how to optimize GPU node scheduling in Kubernetes.",
    "What are the pros and cons of asynchronous programming in Python?",
    "Compare the advantages and disadvantages of microservices architecture.",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    tokens_count = len(output.outputs[0].token_ids)
    print(f"Prompt: {prompt[:50]}...")
    print(f"Generated tokens: {tokens_count}")
    print(f"Response: {generated[:200]}...\n")

vLLM Production Configuration Guide

# vllm-deployment.yaml - Kubernetes deployment example
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
  labels:
    app: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.17.1
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - meta-llama/Llama-3.1-70B-Instruct
            - --tensor-parallel-size
            - '4'
            - --gpu-memory-utilization
            - '0.90'
            - --max-model-len
            - '8192'
            - --enable-prefix-caching
            - --max-num-seqs
            - '256'
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: '4'
            requests:
              nvidia.com/gpu: '4'
              memory: '64Gi'
              cpu: '16'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
      nodeSelector:
        nvidia.com/gpu.product: A100-SXM4-80GB
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-server
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP

TensorRT-LLM: NVIDIA-Optimized Inference

Key Changes in TensorRT-LLM 1.0

TensorRT-LLM 1.0 brought two significant changes. First, the PyTorch-based architecture was stabilized and became the default experience. Second, the LLM API was stabilized. The complex engine build process required in previous versions has been greatly simplified.

Key optimization features include:

  • FP8 and NVFP4 quantization support
  • Disaggregated Serving
  • Wide Expert Parallelism (EP) and other parallelization techniques
  • EAGLE-3 and Multi-Token Prediction-based Speculative Decoding
  • DeepSeek V3/R1 model support

TensorRT-LLM Installation and Inference

# Install TensorRT-LLM (Docker recommended)
docker pull nvcr.io/nvidia/tensorrt-llm:latest

# Or install via pip
pip install tensorrt-llm

# Llama model checkpoint conversion and engine build
# Step 1: Convert HuggingFace model to TensorRT-LLM format
python convert_checkpoint.py \
    --model_dir /models/Llama-3.1-70B-Instruct \
    --output_dir /engines/llama-70b-ckpt \
    --dtype float16 \
    --tp_size 4

# Step 2: Build TensorRT engine
trtllm-build \
    --checkpoint_dir /engines/llama-70b-ckpt \
    --output_dir /engines/llama-70b-engine \
    --gemm_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --paged_kv_cache enable \
    --use_paged_context_fmha enable \
    --multiple_profiles enable

# Step 3: Serve with Triton Inference Server
docker run --gpus all -p 8000:8000 \
    -v /engines:/engines \
    nvcr.io/nvidia/tritonserver:latest \
    tritonserver --model-repository=/engines/model_repo

Using the TensorRT-LLM Python API

import tensorrt_llm
from tensorrt_llm import LLM, SamplingParams, KvCacheConfig

# KV Cache configuration
kv_cache_config = KvCacheConfig(
    free_gpu_memory_fraction=0.85,
    enable_block_reuse=True,
)

# Load model (can build directly from HuggingFace models)
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    kv_cache_config=kv_cache_config,
)

# Run inference
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

prompts = [
    "Explain the most important factors in LLM inference optimization.",
    "Compare the advantages and limitations of TensorRT-LLM.",
]

outputs = llm.generate(prompts, sampling_params=sampling_params)

for output in outputs:
    print(f"Generated result: {output.outputs[0].text[:200]}")

Speculative Decoding: Draft-and-Verify Acceleration

How It Works

Speculative Decoding is a technique proposed in a 2022 Google paper that achieves faster inference speeds without any loss in output quality.

The core idea is as follows.

  1. Draft Model: A small, fast model generates (speculates) K tokens in advance.
  2. Target Model: A large, accurate model verifies all K tokens in a single forward pass in parallel.
  3. Accept/Reject Decision: By comparing against the target model's probability distribution, matching tokens are accepted, and from the point of mismatch, the target model generates the correct token.

The key insight of this approach is that verification is faster than generation. Verifying K tokens requires only a single forward pass of the target model, so higher acceptance rates lead to greater speedups.

Configuring Speculative Decoding in vLLM

# Enable Speculative Decoding in vLLM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --speculative-model meta-llama/Llama-3.1-8B-Instruct \
    --num-speculative-tokens 5 \
    --speculative-max-model-len 4096 \
    --use-v2-block-manager \
    --port 8000
# Script to measure the effect of Speculative Decoding
import time
import requests
import json
import statistics

API_URL = "http://localhost:8000/v1/completions"

def measure_latency(prompt, max_tokens=256, n_requests=10):
    """Measure latency and throughput across multiple requests"""
    latencies = []
    total_tokens = []

    for i in range(n_requests):
        payload = {
            "model": "meta-llama/Llama-3.1-70B-Instruct",
            "prompt": prompt,
            "max_tokens": max_tokens,
            "temperature": 0.7,
        }

        start = time.time()
        response = requests.post(API_URL, json=payload)
        elapsed = time.time() - start

        result = response.json()
        completion_tokens = result["usage"]["completion_tokens"]

        latencies.append(elapsed)
        total_tokens.append(completion_tokens)

    avg_latency = statistics.mean(latencies)
    p99_latency = sorted(latencies)[int(0.99 * len(latencies))]
    avg_throughput = statistics.mean(total_tokens) / avg_latency

    print(f"Average latency: {avg_latency:.3f}s")
    print(f"P99 latency: {p99_latency:.3f}s")
    print(f"Average throughput: {avg_throughput:.1f} tokens/s")
    return avg_latency, p99_latency, avg_throughput

# Test prompt
test_prompt = "Explain the following topic in detail: microservices architecture"

print("=== Speculative Decoding Benchmark ===")
measure_latency(test_prompt)

Latest Speculative Decoding Techniques

Significant advances have been made in Speculative Decoding between 2025 and 2026.

  • EAGLE-3: An advanced speculation technique integrated into TensorRT-LLM that predicts tokens by leveraging the hidden states of the target model itself, without a separate draft model. This eliminates the need for additional draft model memory.
  • Multi-Token Prediction (MTP): A method that predicts multiple tokens simultaneously in a single step, adopted by DeepSeek V3.
  • TurboSpec: A closed-loop control system that dynamically adjusts speculation parameters at runtime, adapting to workload and hardware conditions.
  • Heterogeneous Vocabulary Support: Algorithms have been developed that allow draft and target models to operate without sharing the same vocabulary, broadening the range of draft model choices. Empirical results have reported speedups of up to 2.8x.

Inference Engine Comparison (vLLM vs TGI vs TensorRT-LLM)

Comprehensive Comparison Table

CategoryvLLM (v0.17.x)TGI (v3.x)TensorRT-LLM (v1.0)
DeveloperUC Berkeley / vLLM CommunityHugging FaceNVIDIA
LicenseApache 2.0Apache 2.0Apache 2.0
Core TechnologyPagedAttentionFlashAttention-2/3, FlashInferTensorRT engine optimization
Throughput (req/s)120-160100-140180-220
TTFT50-80ms60-90ms35-50ms
Setup DifficultyLow (pip install)Low (Docker)High (engine build required)
Continuous BatchingSupportedSupportedSupported
QuantizationAWQ, GPTQ, FP8AWQ, GPTQ, BitsAndBytesFP8, NVFP4, INT8, INT4
Speculative DecodingSupportedSupported (limited)EAGLE-3, MTP support
Tensor ParallelismSupportedSupportedSupported
Pipeline ParallelismSupportedNot supportedSupported
Prefix CachingSupported (automatic)SupportedSupported
Long ContextAverageExcellent (13x faster in TGI v3)Excellent
Model CompatibilityVery broadBroadNVIDIA GPU only
API CompatibilityOpenAI-compatibleCustom API + OpenAI-compatibleTriton-based
Community ActivityVery highHighHigh

When to choose vLLM:

  • For rapid prototyping or testing in development environments
  • When supporting a wide variety of models (direct loading of HuggingFace models)
  • When stable latency is needed under high concurrent connections
  • When an OpenAI-compatible API is required

When to choose TGI:

  • When tight integration with the Hugging Face ecosystem is needed
  • When processing ultra-long contexts of 200K+ tokens (13x speedup in TGI v3)
  • When simple Docker-based deployment is preferred

When to choose TensorRT-LLM:

  • When absolute peak performance on NVIDIA GPUs is required
  • For real-time services where TTFT (Time to First Token) is critically important
  • In enterprise environments integrated with NVIDIA Triton Inference Server
  • When leveraging the latest quantization techniques such as FP8/NVFP4

Quantization and Inference Optimization (AWQ, GPTQ, FP8)

Quantization Technique Comparison

Quantization reduces memory usage and increases inference speed by lowering the precision of model weights. It is particularly effective in memory-bound environments (small batches).

TechniqueBitsMemory SavingsQuality LossSpeedupCharacteristics
FP16 (base)16-bit---Baseline
FP8 (W8A8)8-bit~50%-2.7% (long context)1.5-2xNative H100 support, no training required
AWQ (INT4)4-bit~75%-0.2% (long context)2-3xActivation-aware, fast quantization
GPTQ (INT4)4-bit~75%-1.8% (long context)2-3xHessian-based optimization, data required
NVFP44-bit~75%Low2-3xTensorRT-LLM only, Blackwell optimized

Practical Quantization Code

# Example of using AWQ quantized models with vLLM
from vllm import LLM, SamplingParams

# Load AWQ quantized model (use AWQ models directly from HuggingFace)
llm_awq = LLM(
    model="TheBloke/Llama-2-70B-Chat-AWQ",
    quantization="awq",
    tensor_parallel_size=2,        # 4-bit, so 2 GPUs are sufficient
    gpu_memory_utilization=0.90,
    max_model_len=4096,
)

# Load GPTQ quantized model
llm_gptq = LLM(
    model="TheBloke/Llama-2-70B-Chat-GPTQ",
    quantization="gptq",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.90,
)

# FP8 quantization (H100 or above recommended)
llm_fp8 = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    quantization="fp8",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.90,
)

# Performance comparison benchmark
import time

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
test_prompts = ["Explain the pros and cons of microservices architecture."] * 10

for name, model in [("AWQ", llm_awq), ("GPTQ", llm_gptq), ("FP8", llm_fp8)]:
    start = time.time()
    outputs = model.generate(test_prompts, sampling_params)
    elapsed = time.time() - start
    total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
    print(f"{name}: {elapsed:.2f}s, {total_tokens/elapsed:.1f} tokens/s")

Quantization Selection Guide

Here are recommendations for choosing a quantization technique in practice.

  • Start with FP8: If you have H100 or newer GPUs, FP8 can be applied without training data and has minimal quality loss.
  • Use AWQ when memory is tight: Among INT4 quantization methods, AWQ has the least quality degradation (-0.2%) and fast quantization speed.
  • Be aware of batch size effects: At small batch sizes, the workload is memory-bound so quantization is highly effective. At large batch sizes, it becomes compute-bound and INT4-to-FP16 dequantization overhead can diminish the benefits.

Operational Considerations and Troubleshooting

GPU Memory Management

The most common issue in production is GPU OOM (Out of Memory). The following should be checked.

  • Setting gpu-memory-utilization above 0.90 makes the system vulnerable to temporary memory spikes. A range of 0.85-0.90 is safe.
  • Setting max-model-len larger than necessary causes excessive KV cache allocation. Adjust it to match actual usage patterns.
  • With tensor parallelism, inter-GPU communication buffers also consume memory. Verify NVLink connectivity status.

Key Monitoring Metrics

# GPU monitoring - using nvidia-smi
watch -n 1 nvidia-smi

# Check vLLM metrics (Prometheus format)
curl http://localhost:8000/metrics | grep -E "vllm_(num_requests|gpu_cache|avg_generation)"

# Key monitoring targets:
# - vllm:num_requests_running: Number of requests currently being processed
# - vllm:num_requests_waiting: Number of requests in queue
# - vllm:gpu_cache_usage_perc: GPU KV cache usage percentage
# - vllm:avg_generation_throughput_toks_per_s: Average token generation throughput

CUDA Graph and Memory Trade-offs

vLLM uses CUDA Graphs to reduce kernel launch overhead. However, CUDA Graphs consume additional GPU memory. If memory is tight, you can disable them with the --enforce-eager option, but this will reduce throughput.

Request Timeouts and Queue Management

You need to prevent long-running requests from blocking the entire system. Use the --max-num-seqs option to limit the number of concurrent requests, and set timeouts at the proxy level.

Failure Scenarios and Recovery Procedures

Case 1: Request Rejection Due to KV Cache Memory Exhaustion

Symptoms: As gpu_cache_usage_perc approaches 100%, new requests wait indefinitely in the queue or are rejected.

Cause: KV cache space is exhausted due to a surge of long-input requests or a spike in concurrent request volume.

Recovery Procedure:

  1. Reduce max-num-seqs to limit concurrent requests.
  2. Reduce max-model-len to match actual usage patterns.
  3. Apply quantization if necessary to reduce memory footprint of model weights.
  4. Enable Prefix Caching to share KV cache for common system prompts.

Case 2: TensorRT-LLM Engine Build Failure

Symptoms: OOM errors or compatibility errors during the trtllm-build process.

Cause: Significant GPU memory is required even during builds, and build parameters may not match hardware specifications.

Recovery Procedure:

  1. Reduce --max_batch_size and --max_input_len values and retry the build.
  2. Reduce build parallelism with the --workers option.
  3. Verify compatibility of GPU driver, CUDA, and TensorRT versions.
  4. Use Docker images to eliminate environment dependency issues.

Case 3: Low Speculative Decoding Acceptance Rate

Symptoms: Latency actually increases after applying Speculative Decoding.

Cause: When the draft model's prediction accuracy is low and most tokens are rejected, the additional computation from the draft model becomes pure overhead.

Recovery Procedure:

  1. Reduce num-speculative-tokens (from 5 or below to 3 or below).
  2. Replace the draft model with a smaller model from the same family as the target model (e.g., 8B draft for a 70B target).
  3. Monitor the acceptance rate metric and consider changing the draft model or disabling Speculative Decoding if it falls below 60%.
  4. Consider switching to the EAGLE approach to eliminate draft model memory overhead.

Case 4: Load Balancing Imbalance

Symptoms: Load concentrates on specific instances among multiple vLLM instances.

Cause: Simple round-robin load balancing fails to account for differences in request lengths.

Recovery Procedure:

  1. Use a Least-Connection or Weighted Round Robin load balancer.
  2. Distribute requests based on each instance's num_requests_running metric.
  3. Consider separating instances dedicated to long-context requests from those handling short requests.

Conclusion

LLM inference optimization is not a single technique but a comprehensive engineering effort that combines optimizations across multiple layers. KV cache management (PagedAttention), computation optimization (FlashAttention), decoding acceleration (Speculative Decoding), precision optimization (quantization), and system optimization (Continuous Batching, Tensor Parallelism) must work together organically to achieve optimal performance.

Here are the key lessons from real-world experience.

  1. Start with vLLM: It is easy to install, has a large community, and provides sufficient performance for most scenarios.
  2. Consider TensorRT-LLM when TTFT is the top priority: Setup is more complex, but it achieves the best latency performance on NVIDIA GPUs.
  3. Start quantization with FP8: It has minimal quality loss and is simple to configure. Only go down to INT4 (AWQ) when memory is insufficient.
  4. Monitor acceptance rate with Speculative Decoding: If it drops below 60%, it is not effective. Draft model selection is the key factor.
  5. Optimization without monitoring is blind: Always track GPU cache usage, TTFT, throughput, and P99 latency.

LLM inference optimization technology is evolving rapidly. vLLM is preparing its v2 architecture, TensorRT-LLM is strengthening optimizations for next-generation Blackwell GPUs, and Speculative Decoding is advancing with heterogeneous vocabulary support and adaptive control systems. It is important to continuously follow these trends and find the optimal combination for your specific workload.

References