AI Inference Engines 2026 - vLLM · SGLang · llama.cpp · TGI · TensorRT-LLM · MLX · mistral.rs · DeepSpeed-MII · Aphrodite Deep Dive

Prologue — In 2026, Inference Costs More Than Training
1. Why Inference Is the 2026 Battleground
2. PagedAttention and Continuous Batching — The Origin Story
3. vLLM — The De Facto Standard
4. SGLang — Strong on Prefix and Structured Output
5. TensorRT-LLM — King on Blackwell and H100
6. Hugging Face TGI 3.x — Rewritten in Rust
7. llama.cpp — A Universe Built by One Person
8. MLX and MLX-LM — Apple Silicon's Answer
9. mistral.rs — A Multi-Model Engine in Rust
10. DeepSpeed-MII and DeepSpeed Inference — Microsoft's Answer
11. Aphrodite Engine — vLLM Fork for Quantization
12. CTranslate2, ExLlamaV3, OpenVINO — Special-Purpose Engines
13. Triton Inference Server — The Production Wrapper
14. The Quantization Zoo
15. KV Cache Management — The Second Memory War
16. Speculative Decoding — Decode 2-3x Faster
17. Disaggregated Inference — Split Prefill and Decode
18. NVIDIA NIM, Triton, Dynamo — The Enterprise Stack
19. Ollama, LM Studio, Jan — Desktop Inference
20. Alternative Hardware — Groq, Cerebras, SambaNova
21. Inference API Pricing — The Self-Host Decision Line
22. Self-Hosting ROI — H100 vs H200
23. Korean Inference Infrastructure
24. Japanese Inference Infrastructure
25. Engine Selection Guide by Workload
26. Conclusion — Engines Are Decided by Workloads
27. References

Prologue — In 2026, Inference Costs More Than Training

LLM engineering in 2023 was "which model do we use." LLM engineering in 2026 is "how do we serve that model."

The reason is simple. You train once, but inference happens on every request. Run the same Llama 4 405B and your per-token cost differs by 5-10x depending on engine choice. A team that uses one H100 well versus one that uses it poorly: 30x throughput gap.

GPUs are expensive. Used wrong, they are more expensive. The inference engine decides GPU ROI.

This piece dissects the inference engine landscape as of May 2026. vLLM, SGLang, TensorRT-LLM, TGI, llama.cpp, MLX, mistral.rs, DeepSpeed-MII, Aphrodite, CTranslate2, ExLlamaV3, OpenVINO, AWS Neuron, Triton — and the technologies underneath: PagedAttention, Continuous Batching, Speculative Decoding, Disaggregated Inference, KV quantization. Plus Korean and Japanese inference infrastructure and self-hosting ROI.

1. Why Inference Is the 2026 Battleground

In 2024 OpenAI's inference spend was reportedly about 3x its training cost. In 2026 that gap has widened. Training is one-shot. Inference is forever.

The four determinants of inference cost:

Factor	Meaning	Impact
TTFT (Time To First Token)	Time until first token	UX, critical for chat
TPS (Tokens Per Second)	Output tokens per second	Felt generation speed
Throughput	Concurrent tokens per second	Per-GPU capacity, unit cost
Latency P99	99th percentile response time	Tail latency, SLA

Engine choice is a trade-off across these four. Raise throughput and P99 breaks; lower TTFT and TPS suffers. An inference engine is essentially a choice of which trade-off to optimize for which workload.

   +--------------------------------------------+
   |  Throughput     ----> vLLM, SGLang          |
   |  Lowest TTFT    ----> TensorRT-LLM          |
   |  Local / CPU    ----> llama.cpp             |
   |  Apple Silicon  ----> MLX-LM                |
   |  Edge / quant   ----> Aphrodite             |
   |  Prod wrapper   ----> Triton, TGI           |
   |  Serverless API ----> Together, Fireworks   |
   +--------------------------------------------+

2. PagedAttention and Continuous Batching — The Origin Story

The inference world is split into before and after vLLM's 2023 SOSP paper (Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention").

Problem: KV cache length varies per sequence. Pre-allocating contiguous memory wastes 30-80 percent (fragmentation). Mix short and long requests and it gets worse.

Solution: Slice KV cache into fixed-size pages like an OS virtual memory system and map via page tables. Fragmentation drops near zero.

Old way:                       PagedAttention:
[KKKK_____] (50% wasted)       [P1][P2][P3] (page table)
[KKK______] (60% wasted)       [P4][P5]    (allocate on demand)
[KKKKKKKK_] (10% wasted)

On top of this comes Continuous Batching (Orca paper, 2022). Static batching waits until every request in the batch finishes. Continuous batching slots in new requests at token granularity — the GPU never sits idle.

These two are table stakes in 2026. Without them, you do not have a serious inference engine.

3. vLLM — The De Facto Standard

vLLM began in 2023 at UC Berkeley Sky Computing Lab. By 2026 it is effectively the default for open-source inference. Governance moved to the LF AI and Data Foundation in 2025.

vLLM V1 engine (0.7+): Released in 2025, V1 is 1.5-2x faster than V0, with async scheduling, chunked prefill, torch.compile integration, and built-in multimodal support.

Key features:

Feature	Description
Prefix Caching	Cache KV of shared prompts — 90 percent TTFT reduction when reusing system prompts
Speculative Decoding	Draft model or EAGLE-3 for 2-3x decode speedup
Chunked Prefill	Split long prompts into chunks, interleave with decode — stabilizes P99
Tensor Parallelism	Split one model across many GPUs (NVLink recommended)
Pipeline Parallelism	Split layers across GPUs — can cross nodes
Multi-LoRA	Serve many LoRA adapters concurrently
Structured Output	xgrammar / outlines integration
Guided Decoding	JSON schema, regex, context-free grammar

Supported models: Llama 3.x and 4, Qwen 3, Mistral, DeepSeek V3 and R1, Gemma 3, Phi-4, Mixtral, Command-R, GPT-OSS, Granite — over 100 in total.

Typical throughput (H100 80GB, Llama 3.1 8B FP16):

vLLM V1: roughly 6,000-8,000 tok/s (batched, short sequences).
Single request decode: roughly 130 tok/s.

4. SGLang — Strong on Prefix and Structured Output

SGLang was announced by LMSYS Org in 2024 and by 2026, with version 0.4, competes head to head with vLLM V1.

Core features:

RadixAttention — KV cache organized as a tree, auto-sharing prefixes. Faster than vLLM Prefix Caching on workloads heavy in system prompts and few-shot.
Structured Output — regex, JSON Schema, EBNF, Choice, gen() DSL as first-class citizens. Near-zero cost via xgrammar integration.
OpenAI Compatible Server — /v1/chat/completions plus extended /generate.
DP Attention — data-parallel attention, optimized for DeepSeek V3 MLA.
Day-0 model support — Llama 4, DeepSeek R1, Qwen 3 supported on launch day as a habit.

SGLang shines on agentic and RAG workloads with heavy prefix overlap. Anthropic's Claude API reportedly uses a tree-cache internally that resembles SGLang's design (unconfirmed but structurally similar).

5. TensorRT-LLM — King on Blackwell and H100

TensorRT-LLM is NVIDIA's closed-source-flavored (the library itself is Apache 2.0, but NIM and Triton integration are commercial) inference engine.

Why it is fast:

Built on the TensorRT runtime, models compile to engine "plan" files.
CUDA kernels hand-tuned for H100, H200, B100, B200.
Native FP8 (Hopper) and FP4 (Blackwell) support.
In-flight batching, paged KV, speculative decoding (Medusa, EAGLE).

Throughput benchmark (Llama 3.1 70B, H100 by 8, FP8):

TensorRT-LLM: about 13,000-15,000 tok/s.
vLLM: about 9,000-11,000 tok/s.

The gap varies by workload and model, plus or minus 30 percent. On Blackwell, FP4 widens it.

Downsides: Engine builds are heavy (recompile per model, seq length, batch), debugging is hard, model additions track NVIDIA's roadmap. The clean path is to package it as NIM (NVIDIA Inference Microservices) and ship as containers.

6. Hugging Face TGI 3.x — Rewritten in Rust

Text Generation Inference (TGI) is HF's official inference server. It is the backend for HF Inference Endpoints.

TGI 3.x (2025-2026) changes:

Router and launcher are Rust; model execution is PyTorch.
vLLM and TensorRT-LLM backends are selectable — TGI is evolving into a "frontend plus routing plus multi-backend orchestrator."
3.0 long-context single-GPU: TGI claims 13x faster than vLLM on Llama 3 70B at 32K context (HF blog, December 2024).
gRPC, REST, and Messages API.
Prefix Caching, Flash Attention 2 and 3, paged KV are default.

The strength is HF ecosystem integration — clean handoff with transformers, datasets, AutoTrain. Absolute throughput is a little below vLLM and SGLang.

7. llama.cpp — A Universe Built by One Person

llama.cpp is Georgi Gerganov's pure C/C++ inference engine, started in 2023. No dependencies. Builds anywhere.

Why overwhelmingly popular:

4-bit quantized models run on CPU alone. M2/M3/M4 Macs, Raspberry Pi, Android, iOS, even WASM.
GGUF format — packs metadata, weights, tokenizer, and chat template into one file. The 2026 de facto standard.
Backends: CPU (AVX2, AVX-512, NEON), CUDA, Metal (Apple), Vulkan, SYCL (Intel GPU), HIP (AMD), Kompute.
Quantization: Q2_K through Q8_0, importance-aware IQ1_S to IQ4_XS, K-quants.
ggml — the backend tensor library, the heart of llama.cpp.

Limits: Single-GPU prefill is slower than vLLM, and Continuous Batching was a late addition (-cb). Still, local inference equals llama.cpp is an immovable equation.

Ollama, LM Studio, Jan, GPT4All, Kobold.cpp, text-generation-webui — most of them are llama.cpp underneath.

8. MLX and MLX-LM — Apple Silicon's Answer

MLX is Apple's first-party PyTorch alternative. It treats Unified Memory as a first-class citizen — CPU and GPU see the same memory.

MLX differentiators:

Lazy evaluation, dynamic graph.
Auto-routes across Neural Engine, Metal, and CPU on M-series.
NumPy-compatible API, PyTorch-like nn.Module.

MLX-LM: mlx-lm provides a Hugging Face Transformers-style interface for LLMs and supports inference, training, and fine-tuning.

Why it matters: An M3 Ultra Mac Studio with 512GB can run Llama 3.1 405B at 4-bit. One box. A model that needs 8x H100 runs on a desktop. Throughput is one-fifth to one-tenth of an H100, but for local inference, Apple is the only credible competitor to NVIDIA.

Comparison: llama.cpp Metal backend vs MLX-LM

llama.cpp Metal: stable, broad quantization, GGUF.
MLX-LM: faster prefill, trainable, MLX quantization (4-bit and 8-bit).

9. mistral.rs — A Multi-Model Engine in Rust

mistral.rs started as Eric Buehler's one-person project and by 2026 is a serious option.

Features:

Pure Rust, candle and burn based.
Quantization: GGUF, GGML, ISQ (in-situ quantization at load time).
Multi-model concurrent serving (multi-adapter).
Vision model support (LLaVA, Phi Vision, Llama 3.2 Vision, Pixtral, Qwen2-VL).
OpenAI API compatible.
CUDA, Metal, Accelerate, MKL backends.

Why Rust: memory safety, no GC, small container images (tens of MB), fast cold start. A great fit for serverless inference.

It sits between llama.cpp and vLLM — not as featherweight as llama.cpp, but efficient on GPU for quantized models, and not as heavy as vLLM.

10. DeepSpeed-MII and DeepSpeed Inference — Microsoft's Answer

DeepSpeed-MII is the inference library from Microsoft's DeepSpeed team.

Key features:

Blocked KV Cache (analogous to vLLM's PagedAttention).
Continuous Batching plus Dynamic SplitFuse — splits long prompts at token granularity and interleaves with decode.
Tensor Parallelism — the core strength of DeepSpeed Inference.
ZeRO-Inference — offload KV and weights to CPU or NVMe. Squeeze big models into small GPUs.

Why interesting: DeepSpeed started as a training framework, MII is its inference specialization. Parts of Microsoft's own inference (Bing, Copilot) are reported to use MII.

The downside: narrower model support and slower development pace than vLLM and SGLang.

11. Aphrodite Engine — vLLM Fork for Quantization

Aphrodite Engine is PygmalionAI's quantization-friendly fork of vLLM.

Aphrodite vs vLLM:

All quantization formats: GPTQ, AWQ, EXL2, GGUF, SqueezeLLM, Marlin, FP8, FP6, FP5, FP4, AQLM, HQQ.
Better support on older GPUs (back to Turing and Volta).
LoRA on top of quantized models.
Extra decoding samplers (DRY, Mirostat, XTC, Smoothing) — character-chat specialization.

Who uses it: the local hosting community, character AI sites, and teams that take quantized serving seriously. vLLM is mainstream; Aphrodite is the quantization specialty shop.

12. CTranslate2, ExLlamaV3, OpenVINO — Special-Purpose Engines

CTranslate2 (ctranslate2) — OpenNMT team's C++ engine. Built for NMT (translation) and Transformer models. Very fast on CPU, strong INT8/INT16 quantization. faster-whisper (a fast Whisper variant) sits on top of CTranslate2.

ExLlamaV3 (exllamav3) — turboderp's NVIDIA Turing-plus optimized engine. The EXL3 quantization format succeeds EXL2. Best at squeezing 4-bit models into minimal memory. One of the default backends in text-generation-webui.

OpenVINO (openvino) plus Intel Neural Compressor — Intel CPU, GPU, and NPU inference. Uses Intel Arc and Core Ultra NPUs. ONNX conversion, INT8 and INT4 quantization. Good for running LLMs on datacenter Xeons or Lunar Lake laptops.

AWS Inferentia plus Neuron SDK — AWS's inference-dedicated chips Inf2 and Trn2. The Neuron SDK compiles PyTorch models to run on Inferentia. Serve big models like Llama 3 405B on a Trn2 UltraServer. SageMaker integration. In AWS environments, cost and power-per-token can beat GPUs.

13. Triton Inference Server — The Production Wrapper

Triton is NVIDIA's general-purpose inference server. Not LLM-only — it serves every model type (CNN, BERT, XGBoost, TensorRT-LLM, vLLM) behind one interface.

Triton's role:

Backend abstraction — TensorRT-LLM, vLLM, ONNX, Python, PyTorch under one Triton server.
Model ensembles — tokenizer to model to post-processing as a pipeline.
Dynamic batching — per-model batching policy.
Metrics (Prometheus, OpenTelemetry).
A/B testing, canary, model versioning.

NVIDIA NIM is essentially "Triton plus TensorRT-LLM plus a container plus a Helm chart."

14. The Quantization Zoo

Engine choice is inseparable from quantization format compatibility.

Format	Bits	Main engines	Notes
FP16 / BF16	16	All	Baseline, near-zero loss
FP8 (E4M3 / E5M2)	8	TensorRT-LLM, vLLM	Native on H100 and H200
FP4 (E2M1)	4	TensorRT-LLM (Blackwell)	Native on B100 and B200
MXFP4	4	TensorRT-LLM	Microblock FP4
GGUF	2-8	llama.cpp, mistral.rs, Aphrodite	Local standard
GPTQ	4	vLLM, TGI, Aphrodite	Calibration-based, fast
AWQ	4	vLLM, TGI, Aphrodite	Activation-aware, accurate
EXL2 / EXL3	1.5-8	ExLlama	Variable bits, NVIDIA only
EETQ	4	TGI	Tensor-wise INT4
BitNet 1.58	1.58	bitnet.cpp	Ternary (-1, 0, 1)
HQQ	2-8	Aphrodite, transformers	Calibration-free
AQLM	2	Aphrodite	Extreme 2-bit compression

Rule of thumb: quantize Llama 3 70B to 4-bit and the footprint drops from about 40GB to about 20GB. Fits a single RTX 3090 (24GB). At 5-bit, 25GB does not fit. You pick quantization knowing your VRAM ceiling.

15. KV Cache Management — The Second Memory War

After weights, KV cache is the next memory hog. Llama 3 70B at 128K context can consume about 40GB just for KV.

KV quantization techniques:

Technique	Description
FP8 KV	Store KV in FP8 — default in vLLM and TensorRT-LLM
INT8 KV	More aggressive compression, mild loss
KIVI	2-bit asymmetric KV — academic
Quantized KV (vLLM)	User flag `kv_cache_dtype=fp8_e4m3`
Paged KV	PagedAttention removes fragmentation (Section 2)
Prefix Cache	Reuse KV across shared prefixes
Sliding Window	Drop KV outside the window (Mistral and others)
Compressed	H2O, SnapKV select important tokens

To really cut inference cost in 2026, FP8 KV plus Prefix Cache plus Sliding Window is effectively mandatory.

16. Speculative Decoding — Decode 2-3x Faster

Decode is memory-bound — compute on the GPU sits idle. Speculative decoding uses that slack.

Idea: A small "draft" model proposes N tokens quickly. A large "target" model verifies all N in one pass. Matching prefixes are accepted; mismatches are dropped.

Main variants:

Method	Description	Speedup
Draft plus Target	Separate small model (Llama 3 8B plus Llama 3 70B)	2-3x
Medusa	Add multiple LM heads to the model	2x
EAGLE / EAGLE-2 / EAGLE-3	Feature-level draft, higher acceptance	3-4x
Lookahead	n-gram self-speculation	1.5-2x
Prompt Lookup	Copy tokens from input	Strong on code and summaries
ReDrafter	NVIDIA's RNN draft	2-3x

vLLM, SGLang, and TensorRT-LLM all support EAGLE-3 as of 2025-2026. With acceptance above 80 percent, decode runs close to 3x faster.

17. Disaggregated Inference — Split Prefill and Decode

The biggest architectural shift of 2024-2026.

Problem: Prefill is compute-bound, decode is memory-bound. Mix them on the same GPU and neither is efficient. Continuous batching helps, but P99 still wobbles.

Solution: Physically split a Prefill cluster from a Decode cluster. Prefill GPUs produce KV and ship it over a fast network (RDMA, NVLink, InfiniBand) to Decode GPUs. Decode GPUs only decode.

Notable systems:

Mooncake (Moonshot AI, Kimi) — KV cache separated into a distributed cache.
Splitwise (Microsoft Research) — Hopper and Ampere mixed by role.
vLLM Disaggregated Prefill — experimental in V1.
NVIDIA Dynamo (2025) — NVIDIA's disaggregated serving framework.

Why it pays: Prefill GPUs can be H100 while decode GPUs are cheaper L40S or A10. You can mix GPU classes per phase. Reports at scale show 30-50 percent cost savings.

18. NVIDIA NIM, Triton, Dynamo — The Enterprise Stack

NIM (NVIDIA Inference Microservices): essentially "TensorRT-LLM plus Triton plus model weights" packaged into one container. One docker run and you have an OpenAI-compatible server. Requires NVIDIA AI Enterprise. Listed on Azure, AWS, and GCP marketplaces.

NVIDIA Dynamo (2025): Triton's successor. Disaggregated inference, KV cache routing, GPU pool management as first-class. Open source (Apache 2.0).

Why it matters: When an enterprise asks "deploy Llama 4 on our H100 cluster," the answer is typically NIM. Many of them do not want to operate raw vLLM.

19. Ollama, LM Studio, Jan — Desktop Inference

Ollama (ollama) — a Go-based llama.cpp wrapper. ollama run llama4 downloads and runs in one line. The most popular desktop runner on macOS, Linux, and Windows. Model registry at ollama.com.

LM Studio — Electron GUI on llama.cpp and MLX backends. General-user oriented, with a built-in OpenAI-compatible server.

Jan — fully open-source ChatGPT alternative, llama.cpp, Tabby, or remote API backends. Privacy-first.

GPT4All, Kobold.cpp, text-generation-webui — each carves a category (personal, character, research).

Common thread: under the hood, nearly all of them run llama.cpp or MLX. The differentiation is GUI, model management, and chat interface.

20. Alternative Hardware — Groq, Cerebras, SambaNova

Attempts to break NVIDIA's monopoly. In 2026, the ones putting up real throughput:

Groq LPU (groq.com) — Language Processing Unit. 14MB of SRAM is on-chip; no HBM. Result: about 500 tok/s single-stream decode on Llama 3 70B. Roughly 5-10x NVIDIA. Limits include context length and narrower model coverage.

Cerebras CS-3 (cerebras.ai) — one wafer is one chip. 850K AI cores. About 450 tok/s on Llama 3 70B. API access only.

SambaNova SN40L (sambanova.ai) — Reconfigurable Dataflow Unit. Can serve Llama 3.1 405B on a single node. Available via API and on-prem appliance.

These beat NVIDIA on workloads where speed is the absolute value (real-time voice, code autocomplete, multi-step agents). On cost per unit throughput, NVIDIA still wins.

21. Inference API Pricing — The Self-Host Decision Line

At the fork between self-host and API, price is the deciding variable. Rough rates in May 2026 (USD per million tokens, output basis):

Provider	Model	Price tier
Together.ai	Llama 3.1 70B	about 0.88/M
Fireworks	Llama 3.1 70B	about 0.90/M
DeepInfra	Llama 3.1 70B	about 0.60/M
Replicate	Llama 3.1 70B	about 2.75/M
Anyscale	Llama 3.1 70B	about 1.00/M
Groq	Llama 3 70B	about 0.79/M (speed premium)
SambaNova	Llama 3.1 405B	about 5/M
AWS Bedrock	Claude 3.5 Sonnet	about 15/M (reference)

(Prices written as "about 0.88/M" instead of using the dollar sign next to digits, to avoid MDX LaTeX interpretation issues.)

Self-hosting wins when: daily average load exceeds 30 percent of a GPU's capacity and you run 24/7. Below that line, the API is almost always cheaper.

22. Self-Hosting ROI — H100 vs H200

Rough North American figures, May 2026:

H100 80GB one-year lease: about 25,000-32,000 per year (pre-volume discount).
H200 141GB one-year lease: about 35,000-45,000 per year.
B200: about 60,000-80,000 per year (limited supply).

Assume Llama 3.1 70B FP8 on 4x H100 with vLLM:

Throughput: about 9,000 tok/s sustained.
Daily tokens: about 780M.
Monthly tokens: about 23B (23,000M).
Monthly cost (4 by H100 by 730h by about 3.5/h): about 10,000.
Cost per token: about 0.43/M.

The same workload via API at about 0.88/M: 23,000M times about 0.88 equals about 20,000 per month. Self-hosting is half the price.

Caveat: this assumes 24/7 full utilization. At 50 percent utilization, the cost edge disappears.

Effect of H200 and B200: HBM grows from 96GB to 141GB to 192GB, letting bigger models fit on fewer nodes. More KV fits too, and throughput rises 1.3-1.8x. Unit cost drops further.

23. Korean Inference Infrastructure

Naver Cloud HyperCLOVA X Inference — internal and external inference for the HyperCLOVA X model family. Runs in self-operated datacenters (Chuncheon, Sejong) with H100 and H200 clusters. Public API opened in 2025 (Naver Cloud Platform).

Kakao i Cloud — Kakao's LLM infrastructure. Korean-specialized LLM inference and multimodal (KoCLIP, KaLM).

Upstage (upstage.ai) — Solar model series, Predibase partnership for fine-tuning and serving, AWS Foundry-certified Korean company. Strong in Document AI workflows.

Lablup Backend.AI (lablup.com) — GPU cluster management plus inference serving platform. Operate vLLM, TGI, and Triton via GUI. Widely deployed in government and large-enterprise on-prem GPU clusters.

KT Cloud GPU Farm — integrated 5G plus AI, NVIDIA H100 cluster.

SK Telecom AI Pyramid — inference for the in-house 'A.X' LLM, partnerships with Rakuten in Japan and others.

42dot — Hyundai Motor Group autonomous driving plus LLM, with self-operated GPU infrastructure.

24. Japanese Inference Infrastructure

Sakana AI (sakana.ai) — Tokyo-based AI startup. Evolutionary Model Merging combines many small models. Operates its own inference services.

Preferred Networks (PFN) (preferred.jp) — in-house MN-Core accelerator, the PLaMo Japanese LLM family, inference in self-operated datacenters.

SoftBank — large NVIDIA GPU investment in the Stargate Japan datacenter, inference workloads for Cristal Intelligence (OpenAI partnership).

Rakuten — Rakuten AI 2.0 (Mixtral-based), Mistral partnership, AWS Inferentia and Trainium adoption.

LINE Yahoo — Japan's largest messenger, runs in-house LLM inference (Llama-based plus self-trained) for Yahoo Search and AI.

NTT — tsuzumi models (1B-10B, Japanese-efficient), inference via NTT's own cloud.

25. Engine Selection Guide by Workload

Workload	Recommended engine	Why
General chatbot (open-source 70B)	vLLM	Standard, just works
Agentic and RAG (long prefix)	SGLang	RadixAttention tree cache
Lowest TTFT (voice, realtime)	TensorRT-LLM or Groq	Compiled and specialized
Dev on Mac	MLX-LM or llama.cpp	Unified Memory
Local server (single RTX 3090)	llama.cpp or Aphrodite	4-bit quantization
Big model with few GPUs	DeepSpeed-MII plus ZeRO Inference	NVMe offload
HF ecosystem integration	TGI	Matches Inference Endpoints
Multi-model plus A/B testing	Triton plus vLLM	Model ensembles and versioning
Apple Silicon (personal)	Ollama (llama.cpp)	Easiest
Quantization variety	Aphrodite	Every format
Cloud serverless	Together.ai or Fireworks	API
AWS environments	Bedrock or Inferentia	Neuron SDK
Absolute speed	Groq or Cerebras	Specialized hardware
Japanese	NTT tsuzumi or PLaMo	Tokenizer
Korean	HyperCLOVA X or Solar	Tokenizer

26. Conclusion — Engines Are Decided by Workloads

The conclusion in May 2026 is sharp:

There is no single right engine. The workload picks the engine.
vLLM is a safe default. Works anywhere, biggest community. SGLang when prefix overlap is heavy. TensorRT-LLM when absolute throughput is required.
Local equals llama.cpp or MLX. No other choice.
Quantization is non-optional. The era of serving 70B in FP16 is over.
KV management, speculative decoding, and disaggregation decide cost — more than model choice.
Self-hosting ROI is decided by utilization. Below 30 percent, API wins.
Non-NVIDIA alternatives (Groq, Cerebras, AWS Inferentia, Apple Silicon) finally hold meaningful share.

Next steps: measure your workload, benchmark three candidate engines, fit P99 and cost into the SLA, then ship. Measure, do not guess.

27. References

vLLM: https://github.com/vllm-project/vllm
vLLM paper (PagedAttention, SOSP 2023): https://arxiv.org/abs/2309.06180
SGLang: https://github.com/sgl-project/sglang
SGLang blog (RadixAttention): https://lmsys.org/blog/2024-01-17-sglang/
TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM
Hugging Face TGI: https://github.com/huggingface/text-generation-inference
TGI 3.0 long context blog: https://huggingface.co/blog/tgi-v3-overview
llama.cpp: https://github.com/ggml-org/llama.cpp
ggml: https://github.com/ggml-org/ggml
MLX: https://github.com/ml-explore/mlx
MLX-LM: https://github.com/ml-explore/mlx-lm
mistral.rs: https://github.com/EricLBuehler/mistral.rs
DeepSpeed-MII: https://github.com/deepspeedai/DeepSpeed-MII
Aphrodite Engine: https://github.com/aphrodite-engine/aphrodite-engine
CTranslate2: https://github.com/OpenNMT/CTranslate2
ExLlamaV3: https://github.com/turboderp-org/exllamav3
OpenVINO: https://github.com/openvinotoolkit/openvino
AWS Neuron SDK: https://github.com/aws-neuron/aws-neuron-sdk
Triton Inference Server: https://github.com/triton-inference-server/server
NVIDIA Dynamo: https://github.com/ai-dynamo/dynamo
Mooncake paper: https://arxiv.org/abs/2407.00079
Splitwise paper: https://arxiv.org/abs/2311.18677
EAGLE-3 paper: https://arxiv.org/abs/2503.01840
BitNet b1.58 paper: https://arxiv.org/abs/2402.17764
Orca (continuous batching): https://www.usenix.org/conference/osdi22/presentation/yu
Ollama: https://github.com/ollama/ollama
LM Studio: https://lmstudio.ai
Jan: https://github.com/janhq/jan
Groq: https://groq.com
Cerebras: https://cerebras.ai
SambaNova: https://sambanova.ai
Sakana AI: https://sakana.ai
Preferred Networks: https://www.preferred.jp
Upstage: https://www.upstage.ai
Lablup Backend.AI: https://www.lablup.com