Skip to content

필사 모드: AI Inference Engines 2026 - vLLM · SGLang · llama.cpp · TGI · TensorRT-LLM · MLX · mistral.rs · DeepSpeed-MII · Aphrodite Deep Dive

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — In 2026, Inference Costs More Than Training

LLM engineering in 2023 was "which model do we use." LLM engineering in 2026 is **"how do we serve that model."**

The reason is simple. You train once, but **inference happens on every request.** Run the same Llama 4 405B and your per-token cost differs by 5-10x depending on engine choice. A team that uses one H100 well versus one that uses it poorly: 30x throughput gap.

> **GPUs are expensive. Used wrong, they are more expensive.** The inference engine decides GPU ROI.

This piece dissects the inference engine landscape as of May 2026. vLLM, SGLang, TensorRT-LLM, TGI, llama.cpp, MLX, mistral.rs, DeepSpeed-MII, Aphrodite, CTranslate2, ExLlamaV3, OpenVINO, AWS Neuron, Triton — and the technologies underneath: PagedAttention, Continuous Batching, Speculative Decoding, Disaggregated Inference, KV quantization. Plus Korean and Japanese inference infrastructure and self-hosting ROI.

1. Why Inference Is the 2026 Battleground

In 2024 OpenAI's inference spend was reportedly about 3x its training cost. In 2026 that gap has widened. Training is one-shot. Inference is forever.

**The four determinants of inference cost:**

| Factor | Meaning | Impact |

| --- | --- | --- |

| TTFT (Time To First Token) | Time until first token | UX, critical for chat |

| TPS (Tokens Per Second) | Output tokens per second | Felt generation speed |

| Throughput | Concurrent tokens per second | Per-GPU capacity, unit cost |

| Latency P99 | 99th percentile response time | Tail latency, SLA |

Engine choice is a **trade-off** across these four. Raise throughput and P99 breaks; lower TTFT and TPS suffers. **An inference engine is essentially a choice of which trade-off to optimize for which workload.**

+--------------------------------------------+

| Throughput ----> vLLM, SGLang |

| Lowest TTFT ----> TensorRT-LLM |

| Local / CPU ----> llama.cpp |

| Apple Silicon ----> MLX-LM |

| Edge / quant ----> Aphrodite |

| Prod wrapper ----> Triton, TGI |

| Serverless API ----> Together, Fireworks |

+--------------------------------------------+

2. PagedAttention and Continuous Batching — The Origin Story

The inference world is split into before and after vLLM's 2023 SOSP paper (Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention").

**Problem:** KV cache length varies per sequence. Pre-allocating contiguous memory wastes 30-80 percent (fragmentation). Mix short and long requests and it gets worse.

**Solution:** Slice KV cache into fixed-size pages like an OS virtual memory system and map via page tables. Fragmentation drops near zero.

Old way: PagedAttention:

[KKKK_____] (50% wasted) [P1][P2][P3] (page table)

[KKK______] (60% wasted) [P4][P5] (allocate on demand)

[KKKKKKKK_] (10% wasted)

On top of this comes **Continuous Batching** (Orca paper, 2022). Static batching waits until every request in the batch finishes. Continuous batching slots in new requests at token granularity — the GPU never sits idle.

**These two are table stakes in 2026.** Without them, you do not have a serious inference engine.

3. vLLM — The De Facto Standard

[vLLM](https://github.com/vllm-project/vllm) began in 2023 at UC Berkeley Sky Computing Lab. By 2026 it is effectively the default for open-source inference. Governance moved to the LF AI and Data Foundation in 2025.

**vLLM V1 engine (0.7+):** Released in 2025, V1 is 1.5-2x faster than V0, with async scheduling, chunked prefill, torch.compile integration, and built-in multimodal support.

Key features:

| Feature | Description |

| --- | --- |

| Prefix Caching | Cache KV of shared prompts — 90 percent TTFT reduction when reusing system prompts |

| Speculative Decoding | Draft model or EAGLE-3 for 2-3x decode speedup |

| Chunked Prefill | Split long prompts into chunks, interleave with decode — stabilizes P99 |

| Tensor Parallelism | Split one model across many GPUs (NVLink recommended) |

| Pipeline Parallelism | Split layers across GPUs — can cross nodes |

| Multi-LoRA | Serve many LoRA adapters concurrently |

| Structured Output | xgrammar / outlines integration |

| Guided Decoding | JSON schema, regex, context-free grammar |

Supported models: Llama 3.x and 4, Qwen 3, Mistral, DeepSeek V3 and R1, Gemma 3, Phi-4, Mixtral, Command-R, GPT-OSS, Granite — over 100 in total.

Typical throughput (H100 80GB, Llama 3.1 8B FP16):

- vLLM V1: roughly 6,000-8,000 tok/s (batched, short sequences).

- Single request decode: roughly 130 tok/s.

4. SGLang — Strong on Prefix and Structured Output

[SGLang](https://github.com/sgl-project/sglang) was announced by LMSYS Org in 2024 and by 2026, with version 0.4, competes head to head with vLLM V1.

**Core features:**

- **RadixAttention** — KV cache organized as a tree, auto-sharing prefixes. Faster than vLLM Prefix Caching on workloads heavy in system prompts and few-shot.

- **Structured Output** — `regex`, `JSON Schema`, `EBNF`, `Choice`, `gen()` DSL as first-class citizens. Near-zero cost via xgrammar integration.

- **OpenAI Compatible Server** — `/v1/chat/completions` plus extended `/generate`.

- **DP Attention** — data-parallel attention, optimized for DeepSeek V3 MLA.

- **Day-0 model support** — Llama 4, DeepSeek R1, Qwen 3 supported on launch day as a habit.

SGLang shines on **agentic and RAG workloads with heavy prefix overlap**. Anthropic's Claude API reportedly uses a tree-cache internally that resembles SGLang's design (unconfirmed but structurally similar).

5. TensorRT-LLM — King on Blackwell and H100

[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is NVIDIA's closed-source-flavored (the library itself is Apache 2.0, but NIM and Triton integration are commercial) inference engine.

**Why it is fast:**

- Built on the TensorRT runtime, models compile to engine "plan" files.

- CUDA kernels hand-tuned for H100, H200, B100, B200.

- Native FP8 (Hopper) and FP4 (Blackwell) support.

- In-flight batching, paged KV, speculative decoding (Medusa, EAGLE).

**Throughput benchmark (Llama 3.1 70B, H100 by 8, FP8):**

- TensorRT-LLM: about 13,000-15,000 tok/s.

- vLLM: about 9,000-11,000 tok/s.

The gap varies by workload and model, plus or minus 30 percent. On Blackwell, FP4 widens it.

**Downsides:** Engine builds are heavy (recompile per model, seq length, batch), debugging is hard, model additions track NVIDIA's roadmap. The clean path is to package it as **NIM (NVIDIA Inference Microservices)** and ship as containers.

6. Hugging Face TGI 3.x — Rewritten in Rust

[Text Generation Inference](https://github.com/huggingface/text-generation-inference) (TGI) is HF's official inference server. It is the backend for HF Inference Endpoints.

**TGI 3.x (2025-2026) changes:**

- Router and launcher are Rust; model execution is PyTorch.

- vLLM and TensorRT-LLM backends are selectable — TGI is evolving into a "frontend plus routing plus multi-backend orchestrator."

- **3.0 long-context single-GPU**: TGI claims 13x faster than vLLM on Llama 3 70B at 32K context (HF blog, December 2024).

- gRPC, REST, and Messages API.

- Prefix Caching, Flash Attention 2 and 3, paged KV are default.

The strength is **HF ecosystem integration** — clean handoff with `transformers`, `datasets`, AutoTrain. Absolute throughput is a little below vLLM and SGLang.

7. llama.cpp — A Universe Built by One Person

[llama.cpp](https://github.com/ggml-org/llama.cpp) is Georgi Gerganov's pure C/C++ inference engine, started in 2023. No dependencies. Builds anywhere.

**Why overwhelmingly popular:**

- 4-bit quantized models run on CPU alone. M2/M3/M4 Macs, Raspberry Pi, Android, iOS, even WASM.

- **GGUF format** — packs metadata, weights, tokenizer, and chat template into one file. The 2026 de facto standard.

- Backends: CPU (AVX2, AVX-512, NEON), CUDA, Metal (Apple), Vulkan, SYCL (Intel GPU), HIP (AMD), Kompute.

- Quantization: Q2_K through Q8_0, importance-aware IQ1_S to IQ4_XS, K-quants.

- **ggml** — the backend tensor library, the heart of llama.cpp.

**Limits:** Single-GPU prefill is slower than vLLM, and Continuous Batching was a late addition (`-cb`). Still, **local inference equals llama.cpp** is an immovable equation.

`Ollama`, `LM Studio`, `Jan`, `GPT4All`, `Kobold.cpp`, `text-generation-webui` — most of them are llama.cpp underneath.

8. MLX and MLX-LM — Apple Silicon's Answer

[MLX](https://github.com/ml-explore/mlx) is Apple's first-party PyTorch alternative. It treats **Unified Memory** as a first-class citizen — CPU and GPU see the same memory.

**MLX differentiators:**

- Lazy evaluation, dynamic graph.

- Auto-routes across Neural Engine, Metal, and CPU on M-series.

- NumPy-compatible API, PyTorch-like `nn.Module`.

**MLX-LM:** [mlx-lm](https://github.com/ml-explore/mlx-lm) provides a Hugging Face Transformers-style interface for LLMs and supports inference, training, and fine-tuning.

**Why it matters:** An M3 Ultra Mac Studio with 512GB can run Llama 3.1 405B at 4-bit. One box. A model that needs 8x H100 runs on a desktop. Throughput is one-fifth to one-tenth of an H100, but **for local inference, Apple is the only credible competitor to NVIDIA.**

Comparison: llama.cpp Metal backend vs MLX-LM

- llama.cpp Metal: stable, broad quantization, GGUF.

- MLX-LM: faster prefill, trainable, MLX quantization (4-bit and 8-bit).

9. mistral.rs — A Multi-Model Engine in Rust

[mistral.rs](https://github.com/EricLBuehler/mistral.rs) started as Eric Buehler's one-person project and by 2026 is a serious option.

**Features:**

- Pure Rust, candle and burn based.

- Quantization: GGUF, GGML, ISQ (in-situ quantization at load time).

- Multi-model concurrent serving (multi-adapter).

- Vision model support (LLaVA, Phi Vision, Llama 3.2 Vision, Pixtral, Qwen2-VL).

- OpenAI API compatible.

- CUDA, Metal, Accelerate, MKL backends.

**Why Rust:** memory safety, no GC, small container images (tens of MB), fast cold start. A great fit for **serverless inference**.

It sits between llama.cpp and vLLM — not as featherweight as llama.cpp, but efficient on GPU for quantized models, and not as heavy as vLLM.

10. DeepSpeed-MII and DeepSpeed Inference — Microsoft's Answer

[DeepSpeed-MII](https://github.com/deepspeedai/DeepSpeed-MII) is the inference library from Microsoft's DeepSpeed team.

**Key features:**

- **Blocked KV Cache** (analogous to vLLM's PagedAttention).

- **Continuous Batching** plus Dynamic SplitFuse — splits long prompts at token granularity and interleaves with decode.

- **Tensor Parallelism** — the core strength of DeepSpeed Inference.

- **ZeRO-Inference** — offload KV and weights to CPU or NVMe. Squeeze big models into small GPUs.

**Why interesting:** DeepSpeed started as a training framework, MII is its inference specialization. Parts of Microsoft's own inference (Bing, Copilot) are reported to use MII.

The downside: narrower model support and slower development pace than vLLM and SGLang.

11. Aphrodite Engine — vLLM Fork for Quantization

[Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine) is PygmalionAI's quantization-friendly fork of vLLM.

**Aphrodite vs vLLM:**

- **All quantization formats**: GPTQ, AWQ, EXL2, GGUF, SqueezeLLM, Marlin, FP8, FP6, FP5, FP4, AQLM, HQQ.

- Better support on older GPUs (back to Turing and Volta).

- LoRA on top of quantized models.

- Extra decoding samplers (DRY, Mirostat, XTC, Smoothing) — character-chat specialization.

**Who uses it:** the local hosting community, character AI sites, and teams that take quantized serving seriously. vLLM is mainstream; Aphrodite is the quantization specialty shop.

12. CTranslate2, ExLlamaV3, OpenVINO — Special-Purpose Engines

**CTranslate2** ([ctranslate2](https://github.com/OpenNMT/CTranslate2)) — OpenNMT team's C++ engine. Built for NMT (translation) and Transformer models. Very fast on CPU, strong INT8/INT16 quantization. `faster-whisper` (a fast Whisper variant) sits on top of CTranslate2.

**ExLlamaV3** ([exllamav3](https://github.com/turboderp-org/exllamav3)) — turboderp's NVIDIA Turing-plus optimized engine. The EXL3 quantization format succeeds EXL2. Best at squeezing 4-bit models into minimal memory. One of the default backends in text-generation-webui.

**OpenVINO** ([openvino](https://github.com/openvinotoolkit/openvino)) plus **Intel Neural Compressor** — Intel CPU, GPU, and NPU inference. Uses Intel Arc and Core Ultra NPUs. ONNX conversion, INT8 and INT4 quantization. Good for running LLMs on datacenter Xeons or Lunar Lake laptops.

**AWS Inferentia plus Neuron SDK** — AWS's inference-dedicated chips Inf2 and Trn2. The **Neuron SDK** compiles PyTorch models to run on Inferentia. Serve big models like Llama 3 405B on a Trn2 UltraServer. SageMaker integration. In AWS environments, cost and power-per-token can beat GPUs.

13. Triton Inference Server — The Production Wrapper

[Triton](https://github.com/triton-inference-server/server) is NVIDIA's general-purpose inference server. Not LLM-only — it serves **every model type (CNN, BERT, XGBoost, TensorRT-LLM, vLLM)** behind one interface.

**Triton's role:**

- **Backend abstraction** — TensorRT-LLM, vLLM, ONNX, Python, PyTorch under one Triton server.

- **Model ensembles** — tokenizer to model to post-processing as a pipeline.

- **Dynamic batching** — per-model batching policy.

- **Metrics** (Prometheus, OpenTelemetry).

- **A/B testing, canary, model versioning**.

NVIDIA NIM is essentially "Triton plus TensorRT-LLM plus a container plus a Helm chart."

14. The Quantization Zoo

Engine choice is inseparable from **quantization format compatibility**.

| Format | Bits | Main engines | Notes |

| --- | --- | --- | --- |

| FP16 / BF16 | 16 | All | Baseline, near-zero loss |

| FP8 (E4M3 / E5M2) | 8 | TensorRT-LLM, vLLM | Native on H100 and H200 |

| FP4 (E2M1) | 4 | TensorRT-LLM (Blackwell) | Native on B100 and B200 |

| MXFP4 | 4 | TensorRT-LLM | Microblock FP4 |

| GGUF | 2-8 | llama.cpp, mistral.rs, Aphrodite | Local standard |

| GPTQ | 4 | vLLM, TGI, Aphrodite | Calibration-based, fast |

| AWQ | 4 | vLLM, TGI, Aphrodite | Activation-aware, accurate |

| EXL2 / EXL3 | 1.5-8 | ExLlama | Variable bits, NVIDIA only |

| EETQ | 4 | TGI | Tensor-wise INT4 |

| BitNet 1.58 | 1.58 | bitnet.cpp | Ternary (-1, 0, 1) |

| HQQ | 2-8 | Aphrodite, transformers | Calibration-free |

| AQLM | 2 | Aphrodite | Extreme 2-bit compression |

**Rule of thumb:** quantize Llama 3 70B to 4-bit and the footprint drops from about 40GB to about 20GB. Fits a single RTX 3090 (24GB). At 5-bit, 25GB does not fit. **You pick quantization knowing your VRAM ceiling.**

15. KV Cache Management — The Second Memory War

After weights, KV cache is the next memory hog. Llama 3 70B at 128K context can consume about 40GB just for KV.

**KV quantization techniques:**

| Technique | Description |

| --- | --- |

| FP8 KV | Store KV in FP8 — default in vLLM and TensorRT-LLM |

| INT8 KV | More aggressive compression, mild loss |

| KIVI | 2-bit asymmetric KV — academic |

| Quantized KV (vLLM) | User flag `kv_cache_dtype=fp8_e4m3` |

| Paged KV | PagedAttention removes fragmentation (Section 2) |

| Prefix Cache | Reuse KV across shared prefixes |

| Sliding Window | Drop KV outside the window (Mistral and others) |

| Compressed | H2O, SnapKV select important tokens |

To really cut inference cost in 2026, **FP8 KV plus Prefix Cache plus Sliding Window** is effectively mandatory.

16. Speculative Decoding — Decode 2-3x Faster

Decode is memory-bound — compute on the GPU sits idle. **Speculative decoding** uses that slack.

**Idea:** A small "draft" model proposes N tokens quickly. A large "target" model verifies all N in one pass. Matching prefixes are accepted; mismatches are dropped.

**Main variants:**

| Method | Description | Speedup |

| --- | --- | --- |

| Draft plus Target | Separate small model (Llama 3 8B plus Llama 3 70B) | 2-3x |

| Medusa | Add multiple LM heads to the model | 2x |

| EAGLE / EAGLE-2 / EAGLE-3 | Feature-level draft, higher acceptance | 3-4x |

| Lookahead | n-gram self-speculation | 1.5-2x |

| Prompt Lookup | Copy tokens from input | Strong on code and summaries |

| ReDrafter | NVIDIA's RNN draft | 2-3x |

vLLM, SGLang, and TensorRT-LLM all support EAGLE-3 as of 2025-2026. With acceptance above 80 percent, decode runs close to 3x faster.

17. Disaggregated Inference — Split Prefill and Decode

The biggest architectural shift of 2024-2026.

**Problem:** Prefill is compute-bound, decode is memory-bound. Mix them on the same GPU and neither is efficient. Continuous batching helps, but P99 still wobbles.

**Solution:** Physically split a **Prefill cluster** from a **Decode cluster**. Prefill GPUs produce KV and ship it over a fast network (RDMA, NVLink, InfiniBand) to Decode GPUs. Decode GPUs only decode.

**Notable systems:**

- **Mooncake** (Moonshot AI, Kimi) — KV cache separated into a distributed cache.

- **Splitwise** (Microsoft Research) — Hopper and Ampere mixed by role.

- **vLLM Disaggregated Prefill** — experimental in V1.

- **NVIDIA Dynamo** (2025) — NVIDIA's disaggregated serving framework.

**Why it pays:** Prefill GPUs can be H100 while decode GPUs are cheaper L40S or A10. **You can mix GPU classes per phase.** Reports at scale show 30-50 percent cost savings.

18. NVIDIA NIM, Triton, Dynamo — The Enterprise Stack

**NIM (NVIDIA Inference Microservices)**: essentially "TensorRT-LLM plus Triton plus model weights" packaged into one container. One `docker run` and you have an OpenAI-compatible server. Requires NVIDIA AI Enterprise. Listed on Azure, AWS, and GCP marketplaces.

**NVIDIA Dynamo (2025)**: Triton's successor. Disaggregated inference, KV cache routing, GPU pool management as first-class. Open source (Apache 2.0).

**Why it matters:** When an enterprise asks "deploy Llama 4 on our H100 cluster," the answer is typically NIM. Many of them do not want to operate raw vLLM.

19. Ollama, LM Studio, Jan — Desktop Inference

**Ollama** ([ollama](https://github.com/ollama/ollama)) — a Go-based llama.cpp wrapper. `ollama run llama4` downloads and runs in one line. The most popular desktop runner on macOS, Linux, and Windows. Model registry at `ollama.com`.

**LM Studio** — Electron GUI on llama.cpp and MLX backends. General-user oriented, with a built-in OpenAI-compatible server.

**Jan** — fully open-source ChatGPT alternative, llama.cpp, Tabby, or remote API backends. Privacy-first.

**GPT4All, Kobold.cpp, text-generation-webui** — each carves a category (personal, character, research).

**Common thread:** under the hood, nearly all of them run llama.cpp or MLX. The differentiation is GUI, model management, and chat interface.

20. Alternative Hardware — Groq, Cerebras, SambaNova

Attempts to break NVIDIA's monopoly. In 2026, the ones putting up real throughput:

**Groq LPU** ([groq.com](https://groq.com)) — Language Processing Unit. 14MB of SRAM is on-chip; no HBM. Result: **about 500 tok/s single-stream decode on Llama 3 70B.** Roughly 5-10x NVIDIA. Limits include context length and narrower model coverage.

**Cerebras CS-3** ([cerebras.ai](https://cerebras.ai)) — one wafer is one chip. 850K AI cores. About 450 tok/s on Llama 3 70B. API access only.

**SambaNova SN40L** ([sambanova.ai](https://sambanova.ai)) — Reconfigurable Dataflow Unit. Can serve Llama 3.1 405B on a single node. Available via API and on-prem appliance.

These beat NVIDIA on workloads where **speed is the absolute value** (real-time voice, code autocomplete, multi-step agents). On cost per unit throughput, NVIDIA still wins.

21. Inference API Pricing — The Self-Host Decision Line

At the fork between self-host and API, price is the deciding variable. Rough rates in May 2026 (USD per million tokens, output basis):

| Provider | Model | Price tier |

| --- | --- | --- |

| Together.ai | Llama 3.1 70B | about 0.88/M |

| Fireworks | Llama 3.1 70B | about 0.90/M |

| DeepInfra | Llama 3.1 70B | about 0.60/M |

| Replicate | Llama 3.1 70B | about 2.75/M |

| Anyscale | Llama 3.1 70B | about 1.00/M |

| Groq | Llama 3 70B | about 0.79/M (speed premium) |

| SambaNova | Llama 3.1 405B | about 5/M |

| AWS Bedrock | Claude 3.5 Sonnet | about 15/M (reference) |

(Prices written as "about 0.88/M" instead of using the dollar sign next to digits, to avoid MDX LaTeX interpretation issues.)

**Self-hosting wins when:** daily average load exceeds 30 percent of a GPU's capacity and you run 24/7. Below that line, the API is almost always cheaper.

22. Self-Hosting ROI — H100 vs H200

Rough North American figures, May 2026:

- H100 80GB one-year lease: about 25,000-32,000 per year (pre-volume discount).

- H200 141GB one-year lease: about 35,000-45,000 per year.

- B200: about 60,000-80,000 per year (limited supply).

**Assume Llama 3.1 70B FP8 on 4x H100 with vLLM:**

- Throughput: about 9,000 tok/s sustained.

- Daily tokens: about 780M.

- Monthly tokens: about 23B (23,000M).

- Monthly cost (4 by H100 by 730h by about 3.5/h): about 10,000.

- Cost per token: about 0.43/M.

The same workload via API at about 0.88/M: 23,000M times about 0.88 equals about 20,000 per month. **Self-hosting is half the price.**

Caveat: this assumes 24/7 full utilization. At 50 percent utilization, the cost edge disappears.

**Effect of H200 and B200:** HBM grows from 96GB to 141GB to 192GB, letting bigger models fit on fewer nodes. More KV fits too, and throughput rises 1.3-1.8x. Unit cost drops further.

23. Korean Inference Infrastructure

**Naver Cloud HyperCLOVA X Inference** — internal and external inference for the HyperCLOVA X model family. Runs in self-operated datacenters (Chuncheon, Sejong) with H100 and H200 clusters. Public API opened in 2025 (Naver Cloud Platform).

**Kakao i Cloud** — Kakao's LLM infrastructure. Korean-specialized LLM inference and multimodal (KoCLIP, KaLM).

**Upstage** ([upstage.ai](https://www.upstage.ai)) — Solar model series, **Predibase** partnership for fine-tuning and serving, **AWS Foundry**-certified Korean company. Strong in Document AI workflows.

**Lablup Backend.AI** ([lablup.com](https://www.lablup.com)) — GPU cluster management plus inference serving platform. Operate vLLM, TGI, and Triton via GUI. Widely deployed in government and large-enterprise on-prem GPU clusters.

**KT Cloud GPU Farm** — integrated 5G plus AI, NVIDIA H100 cluster.

**SK Telecom AI Pyramid** — inference for the in-house 'A.X' LLM, partnerships with Rakuten in Japan and others.

**42dot** — Hyundai Motor Group autonomous driving plus LLM, with self-operated GPU infrastructure.

24. Japanese Inference Infrastructure

**Sakana AI** ([sakana.ai](https://sakana.ai)) — Tokyo-based AI startup. Evolutionary Model Merging combines many small models. Operates its own inference services.

**Preferred Networks (PFN)** ([preferred.jp](https://www.preferred.jp)) — in-house MN-Core accelerator, the PLaMo Japanese LLM family, inference in self-operated datacenters.

**SoftBank** — large NVIDIA GPU investment in the Stargate Japan datacenter, inference workloads for Cristal Intelligence (OpenAI partnership).

**Rakuten** — Rakuten AI 2.0 (Mixtral-based), Mistral partnership, AWS Inferentia and Trainium adoption.

**LINE Yahoo** — Japan's largest messenger, runs in-house LLM inference (Llama-based plus self-trained) for Yahoo Search and AI.

**NTT** — tsuzumi models (1B-10B, Japanese-efficient), inference via NTT's own cloud.

25. Engine Selection Guide by Workload

| Workload | Recommended engine | Why |

| --- | --- | --- |

| General chatbot (open-source 70B) | vLLM | Standard, just works |

| Agentic and RAG (long prefix) | SGLang | RadixAttention tree cache |

| Lowest TTFT (voice, realtime) | TensorRT-LLM or Groq | Compiled and specialized |

| Dev on Mac | MLX-LM or llama.cpp | Unified Memory |

| Local server (single RTX 3090) | llama.cpp or Aphrodite | 4-bit quantization |

| Big model with few GPUs | DeepSpeed-MII plus ZeRO Inference | NVMe offload |

| HF ecosystem integration | TGI | Matches Inference Endpoints |

| Multi-model plus A/B testing | Triton plus vLLM | Model ensembles and versioning |

| Apple Silicon (personal) | Ollama (llama.cpp) | Easiest |

| Quantization variety | Aphrodite | Every format |

| Cloud serverless | Together.ai or Fireworks | API |

| AWS environments | Bedrock or Inferentia | Neuron SDK |

| Absolute speed | Groq or Cerebras | Specialized hardware |

| Japanese | NTT tsuzumi or PLaMo | Tokenizer |

| Korean | HyperCLOVA X or Solar | Tokenizer |

26. Conclusion — Engines Are Decided by Workloads

The conclusion in May 2026 is sharp:

1. **There is no single right engine.** The workload picks the engine.

2. **vLLM is a safe default.** Works anywhere, biggest community. SGLang when prefix overlap is heavy. TensorRT-LLM when absolute throughput is required.

3. **Local equals llama.cpp or MLX.** No other choice.

4. **Quantization is non-optional.** The era of serving 70B in FP16 is over.

5. **KV management, speculative decoding, and disaggregation** decide cost — more than model choice.

6. **Self-hosting ROI is decided by utilization.** Below 30 percent, API wins.

7. **Non-NVIDIA alternatives** (Groq, Cerebras, AWS Inferentia, Apple Silicon) finally hold meaningful share.

Next steps: measure your workload, benchmark three candidate engines, fit P99 and cost into the SLA, then ship. Measure, do not guess.

27. References

- vLLM: https://github.com/vllm-project/vllm

- vLLM paper (PagedAttention, SOSP 2023): https://arxiv.org/abs/2309.06180

- SGLang: https://github.com/sgl-project/sglang

- SGLang blog (RadixAttention): https://lmsys.org/blog/2024-01-17-sglang/

- TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM

- Hugging Face TGI: https://github.com/huggingface/text-generation-inference

- TGI 3.0 long context blog: https://huggingface.co/blog/tgi-v3-overview

- llama.cpp: https://github.com/ggml-org/llama.cpp

- ggml: https://github.com/ggml-org/ggml

- MLX: https://github.com/ml-explore/mlx

- MLX-LM: https://github.com/ml-explore/mlx-lm

- mistral.rs: https://github.com/EricLBuehler/mistral.rs

- DeepSpeed-MII: https://github.com/deepspeedai/DeepSpeed-MII

- Aphrodite Engine: https://github.com/aphrodite-engine/aphrodite-engine

- CTranslate2: https://github.com/OpenNMT/CTranslate2

- ExLlamaV3: https://github.com/turboderp-org/exllamav3

- OpenVINO: https://github.com/openvinotoolkit/openvino

- AWS Neuron SDK: https://github.com/aws-neuron/aws-neuron-sdk

- Triton Inference Server: https://github.com/triton-inference-server/server

- NVIDIA Dynamo: https://github.com/ai-dynamo/dynamo

- Mooncake paper: https://arxiv.org/abs/2407.00079

- Splitwise paper: https://arxiv.org/abs/2311.18677

- EAGLE-3 paper: https://arxiv.org/abs/2503.01840

- BitNet b1.58 paper: https://arxiv.org/abs/2402.17764

- Orca (continuous batching): https://www.usenix.org/conference/osdi22/presentation/yu

- Ollama: https://github.com/ollama/ollama

- LM Studio: https://lmstudio.ai

- Jan: https://github.com/janhq/jan

- Groq: https://groq.com

- Cerebras: https://cerebras.ai

- SambaNova: https://sambanova.ai

- Sakana AI: https://sakana.ai

- Preferred Networks: https://www.preferred.jp

- Upstage: https://www.upstage.ai

- Lablup Backend.AI: https://www.lablup.com

현재 단락 (1/293)

LLM engineering in 2023 was "which model do we use." LLM engineering in 2026 is **"how do we serve t...

작성 글자: 0원문 글자: 23,072작성 단락: 0/293