- Published on
AI Inference Engines 2026 - vLLM · SGLang · llama.cpp · TGI · TensorRT-LLM · MLX · mistral.rs · DeepSpeed-MII · Aphrodite Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Prologue — In 2026, Inference Costs More Than Training
- 1. Why Inference Is the 2026 Battleground
- 2. PagedAttention and Continuous Batching — The Origin Story
- 3. vLLM — The De Facto Standard
- 4. SGLang — Strong on Prefix and Structured Output
- 5. TensorRT-LLM — King on Blackwell and H100
- 6. Hugging Face TGI 3.x — Rewritten in Rust
- 7. llama.cpp — A Universe Built by One Person
- 8. MLX and MLX-LM — Apple Silicon's Answer
- 9. mistral.rs — A Multi-Model Engine in Rust
- 10. DeepSpeed-MII and DeepSpeed Inference — Microsoft's Answer
- 11. Aphrodite Engine — vLLM Fork for Quantization
- 12. CTranslate2, ExLlamaV3, OpenVINO — Special-Purpose Engines
- 13. Triton Inference Server — The Production Wrapper
- 14. The Quantization Zoo
- 15. KV Cache Management — The Second Memory War
- 16. Speculative Decoding — Decode 2-3x Faster
- 17. Disaggregated Inference — Split Prefill and Decode
- 18. NVIDIA NIM, Triton, Dynamo — The Enterprise Stack
- 19. Ollama, LM Studio, Jan — Desktop Inference
- 20. Alternative Hardware — Groq, Cerebras, SambaNova
- 21. Inference API Pricing — The Self-Host Decision Line
- 22. Self-Hosting ROI — H100 vs H200
- 23. Korean Inference Infrastructure
- 24. Japanese Inference Infrastructure
- 25. Engine Selection Guide by Workload
- 26. Conclusion — Engines Are Decided by Workloads
- 27. References
Prologue — In 2026, Inference Costs More Than Training
LLM engineering in 2023 was "which model do we use." LLM engineering in 2026 is "how do we serve that model."
The reason is simple. You train once, but inference happens on every request. Run the same Llama 4 405B and your per-token cost differs by 5-10x depending on engine choice. A team that uses one H100 well versus one that uses it poorly: 30x throughput gap.
GPUs are expensive. Used wrong, they are more expensive. The inference engine decides GPU ROI.
This piece dissects the inference engine landscape as of May 2026. vLLM, SGLang, TensorRT-LLM, TGI, llama.cpp, MLX, mistral.rs, DeepSpeed-MII, Aphrodite, CTranslate2, ExLlamaV3, OpenVINO, AWS Neuron, Triton — and the technologies underneath: PagedAttention, Continuous Batching, Speculative Decoding, Disaggregated Inference, KV quantization. Plus Korean and Japanese inference infrastructure and self-hosting ROI.
1. Why Inference Is the 2026 Battleground
In 2024 OpenAI's inference spend was reportedly about 3x its training cost. In 2026 that gap has widened. Training is one-shot. Inference is forever.
The four determinants of inference cost:
| Factor | Meaning | Impact |
|---|---|---|
| TTFT (Time To First Token) | Time until first token | UX, critical for chat |
| TPS (Tokens Per Second) | Output tokens per second | Felt generation speed |
| Throughput | Concurrent tokens per second | Per-GPU capacity, unit cost |
| Latency P99 | 99th percentile response time | Tail latency, SLA |
Engine choice is a trade-off across these four. Raise throughput and P99 breaks; lower TTFT and TPS suffers. An inference engine is essentially a choice of which trade-off to optimize for which workload.
+--------------------------------------------+
| Throughput ----> vLLM, SGLang |
| Lowest TTFT ----> TensorRT-LLM |
| Local / CPU ----> llama.cpp |
| Apple Silicon ----> MLX-LM |
| Edge / quant ----> Aphrodite |
| Prod wrapper ----> Triton, TGI |
| Serverless API ----> Together, Fireworks |
+--------------------------------------------+
2. PagedAttention and Continuous Batching — The Origin Story
The inference world is split into before and after vLLM's 2023 SOSP paper (Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention").
Problem: KV cache length varies per sequence. Pre-allocating contiguous memory wastes 30-80 percent (fragmentation). Mix short and long requests and it gets worse.
Solution: Slice KV cache into fixed-size pages like an OS virtual memory system and map via page tables. Fragmentation drops near zero.
Old way: PagedAttention:
[KKKK_____] (50% wasted) [P1][P2][P3] (page table)
[KKK______] (60% wasted) [P4][P5] (allocate on demand)
[KKKKKKKK_] (10% wasted)
On top of this comes Continuous Batching (Orca paper, 2022). Static batching waits until every request in the batch finishes. Continuous batching slots in new requests at token granularity — the GPU never sits idle.
These two are table stakes in 2026. Without them, you do not have a serious inference engine.
3. vLLM — The De Facto Standard
vLLM began in 2023 at UC Berkeley Sky Computing Lab. By 2026 it is effectively the default for open-source inference. Governance moved to the LF AI and Data Foundation in 2025.
vLLM V1 engine (0.7+): Released in 2025, V1 is 1.5-2x faster than V0, with async scheduling, chunked prefill, torch.compile integration, and built-in multimodal support.
Key features:
| Feature | Description |
|---|---|
| Prefix Caching | Cache KV of shared prompts — 90 percent TTFT reduction when reusing system prompts |
| Speculative Decoding | Draft model or EAGLE-3 for 2-3x decode speedup |
| Chunked Prefill | Split long prompts into chunks, interleave with decode — stabilizes P99 |
| Tensor Parallelism | Split one model across many GPUs (NVLink recommended) |
| Pipeline Parallelism | Split layers across GPUs — can cross nodes |
| Multi-LoRA | Serve many LoRA adapters concurrently |
| Structured Output | xgrammar / outlines integration |
| Guided Decoding | JSON schema, regex, context-free grammar |
Supported models: Llama 3.x and 4, Qwen 3, Mistral, DeepSeek V3 and R1, Gemma 3, Phi-4, Mixtral, Command-R, GPT-OSS, Granite — over 100 in total.
Typical throughput (H100 80GB, Llama 3.1 8B FP16):
- vLLM V1: roughly 6,000-8,000 tok/s (batched, short sequences).
- Single request decode: roughly 130 tok/s.
4. SGLang — Strong on Prefix and Structured Output
SGLang was announced by LMSYS Org in 2024 and by 2026, with version 0.4, competes head to head with vLLM V1.
Core features:
- RadixAttention — KV cache organized as a tree, auto-sharing prefixes. Faster than vLLM Prefix Caching on workloads heavy in system prompts and few-shot.
- Structured Output —
regex,JSON Schema,EBNF,Choice,gen()DSL as first-class citizens. Near-zero cost via xgrammar integration. - OpenAI Compatible Server —
/v1/chat/completionsplus extended/generate. - DP Attention — data-parallel attention, optimized for DeepSeek V3 MLA.
- Day-0 model support — Llama 4, DeepSeek R1, Qwen 3 supported on launch day as a habit.
SGLang shines on agentic and RAG workloads with heavy prefix overlap. Anthropic's Claude API reportedly uses a tree-cache internally that resembles SGLang's design (unconfirmed but structurally similar).
5. TensorRT-LLM — King on Blackwell and H100
TensorRT-LLM is NVIDIA's closed-source-flavored (the library itself is Apache 2.0, but NIM and Triton integration are commercial) inference engine.
Why it is fast:
- Built on the TensorRT runtime, models compile to engine "plan" files.
- CUDA kernels hand-tuned for H100, H200, B100, B200.
- Native FP8 (Hopper) and FP4 (Blackwell) support.
- In-flight batching, paged KV, speculative decoding (Medusa, EAGLE).
Throughput benchmark (Llama 3.1 70B, H100 by 8, FP8):
- TensorRT-LLM: about 13,000-15,000 tok/s.
- vLLM: about 9,000-11,000 tok/s.
The gap varies by workload and model, plus or minus 30 percent. On Blackwell, FP4 widens it.
Downsides: Engine builds are heavy (recompile per model, seq length, batch), debugging is hard, model additions track NVIDIA's roadmap. The clean path is to package it as NIM (NVIDIA Inference Microservices) and ship as containers.
6. Hugging Face TGI 3.x — Rewritten in Rust
Text Generation Inference (TGI) is HF's official inference server. It is the backend for HF Inference Endpoints.
TGI 3.x (2025-2026) changes:
- Router and launcher are Rust; model execution is PyTorch.
- vLLM and TensorRT-LLM backends are selectable — TGI is evolving into a "frontend plus routing plus multi-backend orchestrator."
- 3.0 long-context single-GPU: TGI claims 13x faster than vLLM on Llama 3 70B at 32K context (HF blog, December 2024).
- gRPC, REST, and Messages API.
- Prefix Caching, Flash Attention 2 and 3, paged KV are default.
The strength is HF ecosystem integration — clean handoff with transformers, datasets, AutoTrain. Absolute throughput is a little below vLLM and SGLang.
7. llama.cpp — A Universe Built by One Person
llama.cpp is Georgi Gerganov's pure C/C++ inference engine, started in 2023. No dependencies. Builds anywhere.
Why overwhelmingly popular:
- 4-bit quantized models run on CPU alone. M2/M3/M4 Macs, Raspberry Pi, Android, iOS, even WASM.
- GGUF format — packs metadata, weights, tokenizer, and chat template into one file. The 2026 de facto standard.
- Backends: CPU (AVX2, AVX-512, NEON), CUDA, Metal (Apple), Vulkan, SYCL (Intel GPU), HIP (AMD), Kompute.
- Quantization: Q2_K through Q8_0, importance-aware IQ1_S to IQ4_XS, K-quants.
- ggml — the backend tensor library, the heart of llama.cpp.
Limits: Single-GPU prefill is slower than vLLM, and Continuous Batching was a late addition (-cb). Still, local inference equals llama.cpp is an immovable equation.
Ollama, LM Studio, Jan, GPT4All, Kobold.cpp, text-generation-webui — most of them are llama.cpp underneath.
8. MLX and MLX-LM — Apple Silicon's Answer
MLX is Apple's first-party PyTorch alternative. It treats Unified Memory as a first-class citizen — CPU and GPU see the same memory.
MLX differentiators:
- Lazy evaluation, dynamic graph.
- Auto-routes across Neural Engine, Metal, and CPU on M-series.
- NumPy-compatible API, PyTorch-like
nn.Module.
MLX-LM: mlx-lm provides a Hugging Face Transformers-style interface for LLMs and supports inference, training, and fine-tuning.
Why it matters: An M3 Ultra Mac Studio with 512GB can run Llama 3.1 405B at 4-bit. One box. A model that needs 8x H100 runs on a desktop. Throughput is one-fifth to one-tenth of an H100, but for local inference, Apple is the only credible competitor to NVIDIA.
Comparison: llama.cpp Metal backend vs MLX-LM
- llama.cpp Metal: stable, broad quantization, GGUF.
- MLX-LM: faster prefill, trainable, MLX quantization (4-bit and 8-bit).
9. mistral.rs — A Multi-Model Engine in Rust
mistral.rs started as Eric Buehler's one-person project and by 2026 is a serious option.
Features:
- Pure Rust, candle and burn based.
- Quantization: GGUF, GGML, ISQ (in-situ quantization at load time).
- Multi-model concurrent serving (multi-adapter).
- Vision model support (LLaVA, Phi Vision, Llama 3.2 Vision, Pixtral, Qwen2-VL).
- OpenAI API compatible.
- CUDA, Metal, Accelerate, MKL backends.
Why Rust: memory safety, no GC, small container images (tens of MB), fast cold start. A great fit for serverless inference.
It sits between llama.cpp and vLLM — not as featherweight as llama.cpp, but efficient on GPU for quantized models, and not as heavy as vLLM.
10. DeepSpeed-MII and DeepSpeed Inference — Microsoft's Answer
DeepSpeed-MII is the inference library from Microsoft's DeepSpeed team.
Key features:
- Blocked KV Cache (analogous to vLLM's PagedAttention).
- Continuous Batching plus Dynamic SplitFuse — splits long prompts at token granularity and interleaves with decode.
- Tensor Parallelism — the core strength of DeepSpeed Inference.
- ZeRO-Inference — offload KV and weights to CPU or NVMe. Squeeze big models into small GPUs.
Why interesting: DeepSpeed started as a training framework, MII is its inference specialization. Parts of Microsoft's own inference (Bing, Copilot) are reported to use MII.
The downside: narrower model support and slower development pace than vLLM and SGLang.
11. Aphrodite Engine — vLLM Fork for Quantization
Aphrodite Engine is PygmalionAI's quantization-friendly fork of vLLM.
Aphrodite vs vLLM:
- All quantization formats: GPTQ, AWQ, EXL2, GGUF, SqueezeLLM, Marlin, FP8, FP6, FP5, FP4, AQLM, HQQ.
- Better support on older GPUs (back to Turing and Volta).
- LoRA on top of quantized models.
- Extra decoding samplers (DRY, Mirostat, XTC, Smoothing) — character-chat specialization.
Who uses it: the local hosting community, character AI sites, and teams that take quantized serving seriously. vLLM is mainstream; Aphrodite is the quantization specialty shop.
12. CTranslate2, ExLlamaV3, OpenVINO — Special-Purpose Engines
CTranslate2 (ctranslate2) — OpenNMT team's C++ engine. Built for NMT (translation) and Transformer models. Very fast on CPU, strong INT8/INT16 quantization. faster-whisper (a fast Whisper variant) sits on top of CTranslate2.
ExLlamaV3 (exllamav3) — turboderp's NVIDIA Turing-plus optimized engine. The EXL3 quantization format succeeds EXL2. Best at squeezing 4-bit models into minimal memory. One of the default backends in text-generation-webui.
OpenVINO (openvino) plus Intel Neural Compressor — Intel CPU, GPU, and NPU inference. Uses Intel Arc and Core Ultra NPUs. ONNX conversion, INT8 and INT4 quantization. Good for running LLMs on datacenter Xeons or Lunar Lake laptops.
AWS Inferentia plus Neuron SDK — AWS's inference-dedicated chips Inf2 and Trn2. The Neuron SDK compiles PyTorch models to run on Inferentia. Serve big models like Llama 3 405B on a Trn2 UltraServer. SageMaker integration. In AWS environments, cost and power-per-token can beat GPUs.
13. Triton Inference Server — The Production Wrapper
Triton is NVIDIA's general-purpose inference server. Not LLM-only — it serves every model type (CNN, BERT, XGBoost, TensorRT-LLM, vLLM) behind one interface.
Triton's role:
- Backend abstraction — TensorRT-LLM, vLLM, ONNX, Python, PyTorch under one Triton server.
- Model ensembles — tokenizer to model to post-processing as a pipeline.
- Dynamic batching — per-model batching policy.
- Metrics (Prometheus, OpenTelemetry).
- A/B testing, canary, model versioning.
NVIDIA NIM is essentially "Triton plus TensorRT-LLM plus a container plus a Helm chart."
14. The Quantization Zoo
Engine choice is inseparable from quantization format compatibility.
| Format | Bits | Main engines | Notes |
|---|---|---|---|
| FP16 / BF16 | 16 | All | Baseline, near-zero loss |
| FP8 (E4M3 / E5M2) | 8 | TensorRT-LLM, vLLM | Native on H100 and H200 |
| FP4 (E2M1) | 4 | TensorRT-LLM (Blackwell) | Native on B100 and B200 |
| MXFP4 | 4 | TensorRT-LLM | Microblock FP4 |
| GGUF | 2-8 | llama.cpp, mistral.rs, Aphrodite | Local standard |
| GPTQ | 4 | vLLM, TGI, Aphrodite | Calibration-based, fast |
| AWQ | 4 | vLLM, TGI, Aphrodite | Activation-aware, accurate |
| EXL2 / EXL3 | 1.5-8 | ExLlama | Variable bits, NVIDIA only |
| EETQ | 4 | TGI | Tensor-wise INT4 |
| BitNet 1.58 | 1.58 | bitnet.cpp | Ternary (-1, 0, 1) |
| HQQ | 2-8 | Aphrodite, transformers | Calibration-free |
| AQLM | 2 | Aphrodite | Extreme 2-bit compression |
Rule of thumb: quantize Llama 3 70B to 4-bit and the footprint drops from about 40GB to about 20GB. Fits a single RTX 3090 (24GB). At 5-bit, 25GB does not fit. You pick quantization knowing your VRAM ceiling.
15. KV Cache Management — The Second Memory War
After weights, KV cache is the next memory hog. Llama 3 70B at 128K context can consume about 40GB just for KV.
KV quantization techniques:
| Technique | Description |
|---|---|
| FP8 KV | Store KV in FP8 — default in vLLM and TensorRT-LLM |
| INT8 KV | More aggressive compression, mild loss |
| KIVI | 2-bit asymmetric KV — academic |
| Quantized KV (vLLM) | User flag kv_cache_dtype=fp8_e4m3 |
| Paged KV | PagedAttention removes fragmentation (Section 2) |
| Prefix Cache | Reuse KV across shared prefixes |
| Sliding Window | Drop KV outside the window (Mistral and others) |
| Compressed | H2O, SnapKV select important tokens |
To really cut inference cost in 2026, FP8 KV plus Prefix Cache plus Sliding Window is effectively mandatory.
16. Speculative Decoding — Decode 2-3x Faster
Decode is memory-bound — compute on the GPU sits idle. Speculative decoding uses that slack.
Idea: A small "draft" model proposes N tokens quickly. A large "target" model verifies all N in one pass. Matching prefixes are accepted; mismatches are dropped.
Main variants:
| Method | Description | Speedup |
|---|---|---|
| Draft plus Target | Separate small model (Llama 3 8B plus Llama 3 70B) | 2-3x |
| Medusa | Add multiple LM heads to the model | 2x |
| EAGLE / EAGLE-2 / EAGLE-3 | Feature-level draft, higher acceptance | 3-4x |
| Lookahead | n-gram self-speculation | 1.5-2x |
| Prompt Lookup | Copy tokens from input | Strong on code and summaries |
| ReDrafter | NVIDIA's RNN draft | 2-3x |
vLLM, SGLang, and TensorRT-LLM all support EAGLE-3 as of 2025-2026. With acceptance above 80 percent, decode runs close to 3x faster.
17. Disaggregated Inference — Split Prefill and Decode
The biggest architectural shift of 2024-2026.
Problem: Prefill is compute-bound, decode is memory-bound. Mix them on the same GPU and neither is efficient. Continuous batching helps, but P99 still wobbles.
Solution: Physically split a Prefill cluster from a Decode cluster. Prefill GPUs produce KV and ship it over a fast network (RDMA, NVLink, InfiniBand) to Decode GPUs. Decode GPUs only decode.
Notable systems:
- Mooncake (Moonshot AI, Kimi) — KV cache separated into a distributed cache.
- Splitwise (Microsoft Research) — Hopper and Ampere mixed by role.
- vLLM Disaggregated Prefill — experimental in V1.
- NVIDIA Dynamo (2025) — NVIDIA's disaggregated serving framework.
Why it pays: Prefill GPUs can be H100 while decode GPUs are cheaper L40S or A10. You can mix GPU classes per phase. Reports at scale show 30-50 percent cost savings.
18. NVIDIA NIM, Triton, Dynamo — The Enterprise Stack
NIM (NVIDIA Inference Microservices): essentially "TensorRT-LLM plus Triton plus model weights" packaged into one container. One docker run and you have an OpenAI-compatible server. Requires NVIDIA AI Enterprise. Listed on Azure, AWS, and GCP marketplaces.
NVIDIA Dynamo (2025): Triton's successor. Disaggregated inference, KV cache routing, GPU pool management as first-class. Open source (Apache 2.0).
Why it matters: When an enterprise asks "deploy Llama 4 on our H100 cluster," the answer is typically NIM. Many of them do not want to operate raw vLLM.
19. Ollama, LM Studio, Jan — Desktop Inference
Ollama (ollama) — a Go-based llama.cpp wrapper. ollama run llama4 downloads and runs in one line. The most popular desktop runner on macOS, Linux, and Windows. Model registry at ollama.com.
LM Studio — Electron GUI on llama.cpp and MLX backends. General-user oriented, with a built-in OpenAI-compatible server.
Jan — fully open-source ChatGPT alternative, llama.cpp, Tabby, or remote API backends. Privacy-first.
GPT4All, Kobold.cpp, text-generation-webui — each carves a category (personal, character, research).
Common thread: under the hood, nearly all of them run llama.cpp or MLX. The differentiation is GUI, model management, and chat interface.
20. Alternative Hardware — Groq, Cerebras, SambaNova
Attempts to break NVIDIA's monopoly. In 2026, the ones putting up real throughput:
Groq LPU (groq.com) — Language Processing Unit. 14MB of SRAM is on-chip; no HBM. Result: about 500 tok/s single-stream decode on Llama 3 70B. Roughly 5-10x NVIDIA. Limits include context length and narrower model coverage.
Cerebras CS-3 (cerebras.ai) — one wafer is one chip. 850K AI cores. About 450 tok/s on Llama 3 70B. API access only.
SambaNova SN40L (sambanova.ai) — Reconfigurable Dataflow Unit. Can serve Llama 3.1 405B on a single node. Available via API and on-prem appliance.
These beat NVIDIA on workloads where speed is the absolute value (real-time voice, code autocomplete, multi-step agents). On cost per unit throughput, NVIDIA still wins.
21. Inference API Pricing — The Self-Host Decision Line
At the fork between self-host and API, price is the deciding variable. Rough rates in May 2026 (USD per million tokens, output basis):
| Provider | Model | Price tier |
|---|---|---|
| Together.ai | Llama 3.1 70B | about 0.88/M |
| Fireworks | Llama 3.1 70B | about 0.90/M |
| DeepInfra | Llama 3.1 70B | about 0.60/M |
| Replicate | Llama 3.1 70B | about 2.75/M |
| Anyscale | Llama 3.1 70B | about 1.00/M |
| Groq | Llama 3 70B | about 0.79/M (speed premium) |
| SambaNova | Llama 3.1 405B | about 5/M |
| AWS Bedrock | Claude 3.5 Sonnet | about 15/M (reference) |
(Prices written as "about 0.88/M" instead of using the dollar sign next to digits, to avoid MDX LaTeX interpretation issues.)
Self-hosting wins when: daily average load exceeds 30 percent of a GPU's capacity and you run 24/7. Below that line, the API is almost always cheaper.
22. Self-Hosting ROI — H100 vs H200
Rough North American figures, May 2026:
- H100 80GB one-year lease: about 25,000-32,000 per year (pre-volume discount).
- H200 141GB one-year lease: about 35,000-45,000 per year.
- B200: about 60,000-80,000 per year (limited supply).
Assume Llama 3.1 70B FP8 on 4x H100 with vLLM:
- Throughput: about 9,000 tok/s sustained.
- Daily tokens: about 780M.
- Monthly tokens: about 23B (23,000M).
- Monthly cost (4 by H100 by 730h by about 3.5/h): about 10,000.
- Cost per token: about 0.43/M.
The same workload via API at about 0.88/M: 23,000M times about 0.88 equals about 20,000 per month. Self-hosting is half the price.
Caveat: this assumes 24/7 full utilization. At 50 percent utilization, the cost edge disappears.
Effect of H200 and B200: HBM grows from 96GB to 141GB to 192GB, letting bigger models fit on fewer nodes. More KV fits too, and throughput rises 1.3-1.8x. Unit cost drops further.
23. Korean Inference Infrastructure
Naver Cloud HyperCLOVA X Inference — internal and external inference for the HyperCLOVA X model family. Runs in self-operated datacenters (Chuncheon, Sejong) with H100 and H200 clusters. Public API opened in 2025 (Naver Cloud Platform).
Kakao i Cloud — Kakao's LLM infrastructure. Korean-specialized LLM inference and multimodal (KoCLIP, KaLM).
Upstage (upstage.ai) — Solar model series, Predibase partnership for fine-tuning and serving, AWS Foundry-certified Korean company. Strong in Document AI workflows.
Lablup Backend.AI (lablup.com) — GPU cluster management plus inference serving platform. Operate vLLM, TGI, and Triton via GUI. Widely deployed in government and large-enterprise on-prem GPU clusters.
KT Cloud GPU Farm — integrated 5G plus AI, NVIDIA H100 cluster.
SK Telecom AI Pyramid — inference for the in-house 'A.X' LLM, partnerships with Rakuten in Japan and others.
42dot — Hyundai Motor Group autonomous driving plus LLM, with self-operated GPU infrastructure.
24. Japanese Inference Infrastructure
Sakana AI (sakana.ai) — Tokyo-based AI startup. Evolutionary Model Merging combines many small models. Operates its own inference services.
Preferred Networks (PFN) (preferred.jp) — in-house MN-Core accelerator, the PLaMo Japanese LLM family, inference in self-operated datacenters.
SoftBank — large NVIDIA GPU investment in the Stargate Japan datacenter, inference workloads for Cristal Intelligence (OpenAI partnership).
Rakuten — Rakuten AI 2.0 (Mixtral-based), Mistral partnership, AWS Inferentia and Trainium adoption.
LINE Yahoo — Japan's largest messenger, runs in-house LLM inference (Llama-based plus self-trained) for Yahoo Search and AI.
NTT — tsuzumi models (1B-10B, Japanese-efficient), inference via NTT's own cloud.
25. Engine Selection Guide by Workload
| Workload | Recommended engine | Why |
|---|---|---|
| General chatbot (open-source 70B) | vLLM | Standard, just works |
| Agentic and RAG (long prefix) | SGLang | RadixAttention tree cache |
| Lowest TTFT (voice, realtime) | TensorRT-LLM or Groq | Compiled and specialized |
| Dev on Mac | MLX-LM or llama.cpp | Unified Memory |
| Local server (single RTX 3090) | llama.cpp or Aphrodite | 4-bit quantization |
| Big model with few GPUs | DeepSpeed-MII plus ZeRO Inference | NVMe offload |
| HF ecosystem integration | TGI | Matches Inference Endpoints |
| Multi-model plus A/B testing | Triton plus vLLM | Model ensembles and versioning |
| Apple Silicon (personal) | Ollama (llama.cpp) | Easiest |
| Quantization variety | Aphrodite | Every format |
| Cloud serverless | Together.ai or Fireworks | API |
| AWS environments | Bedrock or Inferentia | Neuron SDK |
| Absolute speed | Groq or Cerebras | Specialized hardware |
| Japanese | NTT tsuzumi or PLaMo | Tokenizer |
| Korean | HyperCLOVA X or Solar | Tokenizer |
26. Conclusion — Engines Are Decided by Workloads
The conclusion in May 2026 is sharp:
- There is no single right engine. The workload picks the engine.
- vLLM is a safe default. Works anywhere, biggest community. SGLang when prefix overlap is heavy. TensorRT-LLM when absolute throughput is required.
- Local equals llama.cpp or MLX. No other choice.
- Quantization is non-optional. The era of serving 70B in FP16 is over.
- KV management, speculative decoding, and disaggregation decide cost — more than model choice.
- Self-hosting ROI is decided by utilization. Below 30 percent, API wins.
- Non-NVIDIA alternatives (Groq, Cerebras, AWS Inferentia, Apple Silicon) finally hold meaningful share.
Next steps: measure your workload, benchmark three candidate engines, fit P99 and cost into the SLA, then ship. Measure, do not guess.
27. References
- vLLM: https://github.com/vllm-project/vllm
- vLLM paper (PagedAttention, SOSP 2023): https://arxiv.org/abs/2309.06180
- SGLang: https://github.com/sgl-project/sglang
- SGLang blog (RadixAttention): https://lmsys.org/blog/2024-01-17-sglang/
- TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM
- Hugging Face TGI: https://github.com/huggingface/text-generation-inference
- TGI 3.0 long context blog: https://huggingface.co/blog/tgi-v3-overview
- llama.cpp: https://github.com/ggml-org/llama.cpp
- ggml: https://github.com/ggml-org/ggml
- MLX: https://github.com/ml-explore/mlx
- MLX-LM: https://github.com/ml-explore/mlx-lm
- mistral.rs: https://github.com/EricLBuehler/mistral.rs
- DeepSpeed-MII: https://github.com/deepspeedai/DeepSpeed-MII
- Aphrodite Engine: https://github.com/aphrodite-engine/aphrodite-engine
- CTranslate2: https://github.com/OpenNMT/CTranslate2
- ExLlamaV3: https://github.com/turboderp-org/exllamav3
- OpenVINO: https://github.com/openvinotoolkit/openvino
- AWS Neuron SDK: https://github.com/aws-neuron/aws-neuron-sdk
- Triton Inference Server: https://github.com/triton-inference-server/server
- NVIDIA Dynamo: https://github.com/ai-dynamo/dynamo
- Mooncake paper: https://arxiv.org/abs/2407.00079
- Splitwise paper: https://arxiv.org/abs/2311.18677
- EAGLE-3 paper: https://arxiv.org/abs/2503.01840
- BitNet b1.58 paper: https://arxiv.org/abs/2402.17764
- Orca (continuous batching): https://www.usenix.org/conference/osdi22/presentation/yu
- Ollama: https://github.com/ollama/ollama
- LM Studio: https://lmstudio.ai
- Jan: https://github.com/janhq/jan
- Groq: https://groq.com
- Cerebras: https://cerebras.ai
- SambaNova: https://sambanova.ai
- Sakana AI: https://sakana.ai
- Preferred Networks: https://www.preferred.jp
- Upstage: https://www.upstage.ai
- Lablup Backend.AI: https://www.lablup.com