- Published on
LLM Serving & Local Inference in 2026 — vLLM / llama.cpp / MLX / Ollama / LM Studio / SGLang / TGI Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — "Where you run the model" matters again
In 2023, everyone used the OpenAI API. In 2024, Anthropic, Google, and Mistral joined in, and the question was "which API." From 2025 onward, the question shifted back to "where do you run it." Model sizes shrank to usable tiers (8B, 30B, 70B). Quantization got good enough that Q4_K_M is genuinely usable. Laptops started shipping with 200GB of unified memory (M4 Max 128GB, M3 Ultra 192GB). Cloud token prices fluctuate by the week. As a result, whether you run the same model on an API, a dedicated server, or locally now determines your cost, latency, and governance posture.
The problem is there are too many choices. vLLM, SGLang, TGI, llama.cpp, MLX, llamafile, Ollama, LM Studio, GPT4All, KTransformers, MLC LLM, Triton, TensorRT-LLM, Modular MAX — each has a different texture, suits different workloads, and overlaps with others. On the cloud side, Together, Fireworks, Groq, Cerebras, SambaNova, and Lepton (acquired by NVIDIA in 2025) all pitch "just call the API and we'll run it well."
This post draws the map as of May 2026. Who does what well, which camp suits which workload, and how this all meets the Korean and Japanese model ecosystems.
Chapter 1 · The 2026 LLM serving map — three camps
Big picture first. The 2026 LLM serving and inference market splits into roughly three camps.
Camp A · Datacenter / high-throughput serving
Production serving for enterprises, SaaS, and research labs. Tens to thousands of RPS, multiple users, GPU clusters.
- vLLM — the de facto standard for PagedAttention
- SGLang — structured generation with RadixAttention
- TGI (Text Generation Inference) — Hugging Face's ops-friendly server
- NVIDIA Triton + TensorRT-LLM — the NVIDIA full-stack
- Modular MAX — the Mojo team's new entry
- KTransformers — long-context and MoE optimized
Camp B · Local / single-user
Laptops, workstations, home servers. Sub-1 RPS, single user, privacy-first.
- llama.cpp — C++ core, GGUF format, CPU/GPU dual-wield
- MLX (Apple) — Apple Silicon only
- llamafile (Mozilla) — single executable binary
- Ollama — the easiest local LLM, wrapping llama.cpp
- LM Studio — GUI-first local
- GPT4All — community-driven desktop chat
- MLC LLM — mobile, web GPU, diverse hardware
Camp C · Cloud serving SaaS
"Pick a model, call an API" — no GPU ops burden.
- Together AI — the big sibling of OSS model hosting
- Fireworks AI — fast throughput, strong custom fine-tune support
- Groq — LPU (Language Processing Unit) custom silicon, ultra-low latency
- Cerebras — wafer-scale engine, giant single chip
- SambaNova — RDU custom silicon
- Lepton AI — acquired by NVIDIA in 2025, now integrated into NVIDIA's cloud stack
Each camp was built under different assumptions. Datacenter: "GPUs are expensive, squeeze utilization." Local: "make it run on my laptop." Cloud SaaS: "forget GPU ops, just watch the token unit price." So even when they look like they're solving the same problem, who solves it well differs.
Chapter 2 · vLLM — the PagedAttention standard
vLLM started in 2023 at UC Berkeley (Sky Computing Lab) as an OSS inference server. It became the vLLM Project as an independent OSS organization in 2024, and in 2025 joined the PyTorch Foundation under the Linux Foundation. By 2026 it is the de facto standard for OSS LLM serving.
The core invention is PagedAttention. By managing the KV cache in block units like an OS page table, vLLM eliminates memory fragmentation and lets many concurrent requests' KV caches share the same GPU efficiently. The result is 2–4x throughput over naive batching as a typical baseline.
Where vLLM stands in 2026:
- continuous batching — requests at different token positions ride in the same batch.
- prefix caching — repeated system prompts and few-shots reuse KV.
- speculative decoding — a draft model predicts tokens, the main model verifies.
- chunked prefill — long prefills split into chunks and overlap with decode.
- disaggregated serving — prefill and decode run on separate GPUs (P/D split).
- multi-modal — vision-language models are first-class citizens.
- tool use and structured output — function calling and JSON schema enforcement.
A typical vLLM serve:
pip install vllm
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--enable-prefix-caching \
--enable-chunked-prefill
This launches an OpenAI-compatible HTTP server on port 8000, and clients use the OpenAI SDK as-is.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
When vLLM: multi-user production serving, OSS models, NVIDIA (or AMD ROCm, Intel Gaudi, TPU) GPUs available, throughput and cost efficiency as priorities. For a single-user laptop, it is overkill.
Chapter 3 · llama.cpp — gguf, CPU/GPU dual-wield
llama.cpp is the C/C++ inference engine Georgi Gerganov started in 2023. Built on the ambition of "run an LLM with one binary and no external dependencies," by 2026 it is the base layer of local and embedded LLM inference. Almost every user-friendly local tool — Ollama, LM Studio, llamafile, GPT4All — uses llama.cpp internally.
Core texture:
- CPU inference is genuinely usable — AVX/AVX2/AVX-512 and ARM NEON optimization, plus Metal and Accelerate on M-series.
- Rich GPU backends — CUDA, Metal, Vulkan, SYCL, OpenCL, ROCm, Kompute. Vulkan support means it runs on practically any modern GPU.
- GGUF format — many quantization tiers like
Q4_K_M,Q5_K_M,Q8_0,IQ3_XXS. One file holds weights, metadata, and tokenizer. - Memory-mapped loading — models load via mmap for fast cold start, and the same model can be shared across processes.
You can spin up an OpenAI-compatible server with llama-server.
./llama-server \
--model models/Llama-3.3-70B-Q4_K_M.gguf \
--n-gpu-layers 99 \
--ctx-size 8192 \
--host 0.0.0.0 --port 8080
Strengths and weaknesses in 2026:
- Strengths: runs anywhere, the GGUF quantization library is enormous, near-zero dependencies (just C++), works on mobile and embedded, the community ships quantization variants and perf patches fast.
- Weaknesses: multi-user throughput lags vLLM and SGLang (request batching is later-stage and simpler), MoE and long-context optimizations are faster on KTransformers and vLLM.
When llama.cpp: personal or small-user local, embedded/edge, leveraging GGUF quantization, "I want it to end in one binary."
Chapter 4 · MLX (Apple) — Apple Silicon inference
MLX is the OSS array and ML framework Apple ML Research published in December 2023. It has an API similar to NumPy plus PyTorch, and treats Apple Silicon's unified memory as a first-class citizen. On M1/M2/M3/M4 chips, CPU and GPU share the same memory, and MLX flows computation across it without data copies.
MLX in 2026:
- mlx-lm — pull a Hugging Face Hub model and run inference or fine-tune in one line.
- MoE and long-context acceleration — Mistral, Qwen, and Llama-family MoE models run fast.
- Quantization — 4-bit, 8-bit, 6-bit, grouped quantization built in.
- Distributed — chain several Macs to infer big models (MLX distributed). Pair two M3 Ultra 192GB units and run a 405B model.
- mlx-vlm — vision-language model support.
Typical MLX use:
pip install mlx mlx-lm
mlx_lm.generate \
--model mlx-community/Llama-3.3-70B-Instruct-4bit \
--prompt "Why run LLMs on Apple Silicon" \
--max-tokens 200
Directly from Python:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-3.3-70B-Instruct-4bit")
text = generate(model, tokenizer, prompt="Hello", max_tokens=200)
print(text)
Why MLX matters more and more:
- M3 Ultra 192GB and M4 Max 128GB turned laptops and desktops into machines that fit 70B–405B models after quantization. Models that don't fit in a single H100 80GB do fit in unified memory.
- Prototype on a developer laptop, move to a server — building with MLX and then porting the same GGUF or another format to a vLLM/llama.cpp server has become a common workflow.
- Fine-tuning works — LoRA/QLoRA runs on Apple Silicon. Real fine-tuning without CUDA, actually working.
Weakness: Apple Silicon only. Multi-user serving is not as polished as vLLM.
When MLX: local LLM development, fine-tuning, prototyping on Mac, workloads tied to M-series laptops or desktops.
Chapter 5 · llamafile (Mozilla) — single binary
llamafile is the project Mozilla (Mozilla Innovation Group) published in November 2023. The core idea: bundle model weights + inference engine + tokenizer into a single executable that runs by double-click on any OS and CPU architecture.
The technical trick is Actually Portable Executable (APE), a format authored by Justine Tunney (same person behind Cosmopolitan libc). One file runs as ELF on Linux, Mach-O on macOS, and PE on Windows simultaneously — and FreeBSD/OpenBSD/NetBSD on top. The llama.cpp inference engine and GGUF weights are bundled in.
Typical use:
curl -L -o llava-v1.5.llamafile \
https://huggingface.co/Mozilla/llava-v1.5-7B-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5.llamafile
./llava-v1.5.llamafile
This boots an OpenAI-compatible HTTP server, and the same file can be copied to Mac/Linux/Windows and run as-is. In 2026, Mozilla continues publishing llamafile variants of major OSS models (LLaMA, Mistral, Phi, Gemma, Qwen) on Hugging Face.
Why llamafile matters:
- Deployment friction goes to near zero — genuinely good for non-technical users, classrooms, offline environments, and "download once, disconnect, use forever."
- Archival meaning — the same file will likely run five years from now (APE format plus self-contained weights).
- CPU inference is also optimized — Justine Tunney hand-wrote SIMD matmul kernels, and the same model often runs faster on CPU than vanilla llama.cpp.
Weakness: single-file size limits (previously 4GB, worked around with ZipAlign and external weights since 2024); production serving isn't its texture.
When llamafile: handing an LLM to non-technical users, offline distribution, classrooms and workshops, archival.
Chapter 6 · Ollama — the easiest local LLM
Ollama is a local LLM runtime that started in June 2023. Internally it uses llama.cpp, but layers on Docker-clean UX — pull, run, push for models, and a Modelfile for customization — and in 2026 has become the default entry point for local LLMs.
# Install (Mac/Linux/Windows)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model and chat immediately
ollama run llama3.3
# Backend server mode
ollama serve
# Call the API from another terminal
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3",
"prompt": "Hello",
"stream": false
}'
Where Ollama stands in 2026:
- OpenAI-compatible endpoint —
/v1/chat/completionsis also exposed so existing client code drops in. - Modelfile — a Dockerfile-like format to bundle system prompt, temperature, and LoRA into "my model."
- Direct GGUF import — bring any GGUF from Hugging Face.
- Multi-modal — vision models like LLaVA, MoonDream, and Llama 3.2 Vision are supported.
- Auto GPU detection — Metal, CUDA, ROCm picked up automatically.
- Ollama Cloud — started in 2025, hosted side for when local needs to scale up.
The draw is simplicity. "Install on Mac and chat with a quantized 70B in five minutes" genuinely works. As a result, almost every OSS LLM tool — LangChain, LlamaIndex, n8n, VS Code extensions — supports "Ollama-compatible" out of the box.
Weakness: multi-user throughput and advanced batching trail vLLM/SGLang. The Modelfile DSL is thin. The model library trends toward Ollama Hub, though arbitrary GGUF import is possible.
When Ollama: personal laptop, home server, "start in five minutes," instant connection to OSS tools.
Chapter 7 · LM Studio / GPT4All — GUI options
For users who prefer GUI to CLI.
LM Studio
A desktop app (Mac/Windows/Linux) that started in 2023. Model search, download, chat, and server mode all in one .app. You can search Hugging Face Hub for GGUF models, filter and download them, and either chat in the UI or launch an OpenAI-compatible local server.
LM Studio in 2026:
- llama.cpp plus MLX backends — MLX on Apple Silicon, llama.cpp elsewhere.
- Compatibility indicators — model cards show whether the model fits in your RAM/VRAM.
- Local server — one toggle to OpenAI-compatible server.
- Multi-model loading — load several models at once.
- Structured output and tool use — JSON schema and function calling.
After its license change made commercial use free (personal and corporate alike), it's often used for in-house demos and POCs.
GPT4All
An OSS project Nomic AI started in early 2023. Desktop app + Python SDK + model library. With "an LLM that runs on any laptop" as its motto, it's strong on CPU inference and small models (3B, 7B, 13B). Features like LocalDocs (local document retrieval) put RAG directly into the desktop app.
GPT4All in 2026:
- The desktop chat UX is clean, equally polished across Windows, Mac, and Linux.
- A paid GPT4All Enterprise edition bundles in-company deployment tooling.
- Tied to the Nomic Atlas embedding and visualization stack.
LM Studio vs GPT4All in one line: LM Studio is "GUI strong on exploring Hugging Face models," GPT4All is "GUI strong on desktop chat plus LocalDocs RAG."
When GUI: non-technical users, demos, education, internal tools, "I want to finish everything with a mouse."
Chapter 8 · SGLang — structured generation
SGLang is an inference system from a joint LMSYS, UCB, and Stanford collaboration, published in early 2024. It belongs to the same "high-throughput serving" camp as vLLM, but enters with two distinguishing cards — structured generation and RadixAttention.
RadixAttention
If PagedAttention solved "memory fragmentation," RadixAttention shares prefixes across requests in a radix tree. With 1000 requests sharing the same system prompt, the KV cache for that prefix is computed once and shared automatically. The effect is large on agentic workloads (repeated tool specs and few-shots).
Frontend DSL — structured generation
SGLang writes LLM calls in a Python DSL where branching, looping, parallelism, and constraints are first-class.
import sglang as sgl
@sgl.function
def multi_turn_question(s, question):
s += sgl.system("You are a helpful assistant.")
s += sgl.user(question)
s += sgl.assistant(sgl.gen("answer", max_tokens=256))
s += sgl.user("Was that answer correct?")
s += sgl.assistant(sgl.gen("verification", choices=["yes", "no"]))
state = multi_turn_question.run(question="What is the capital of France?")
print(state["answer"], state["verification"])
sgl.gen(..., choices=...) and JSON-schema constraints are enforced by constrained decoding in the backend. That is, the model can't emit anything other than "yes"/"no."
SGLang strengths in 2026:
- Throughput advantage over vLLM on agentic and tool-calling workloads is often observed (benchmarks are workload dependent).
- MoE optimization for DeepSeek, Qwen, Llama lands quickly.
- OpenAI-compatible server is provided too, keeping existing clients happy.
- Disaggregated serving and speculative decoding landed recently.
When SGLang: structured output and tool use heavy workloads, agentic serving with frequent prefix sharing, evaluating alongside vLLM with your own workload as the benchmark.
Chapter 9 · TGI (Hugging Face) — ops-friendly
TGI (Text Generation Inference) is the inference server Hugging Face built. Started in late 2022, by 2026 it draws less academic spotlight than vLLM/SGLang but has settled in as the ops-friendly choice.
- Wired directly to Hugging Face Hub —
--model-idone liner pulls and serves a Hub model. - OpenAI compatibility plus Messages API — standard interface.
- Continuous batching, flash attention, paged KV — the core accelerations are all there.
- safetensors, BitsAndBytes, AWQ, GPTQ, EETQ, FP8 — broad quantization coverage.
- Backend for Hugging Face Inference Endpoints — self-host the same thing the hosted service runs.
- Grafana metrics and OpenTelemetry traces built in.
docker run --gpus all -p 8080:80 \
-v $PWD/models:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.3-70B-Instruct \
--quantize bitsandbytes-nf4
The license briefly went to HFOIL in 2024 before returning to Apache 2.0. As of 2026 it's free OSS again.
When TGI: teams already heavy on Hugging Face Hub, mode parity with Hugging Face Endpoints, preference for "ops metrics and standard docker container" texture.
vLLM/SGLang vs TGI in one line: academic accel innovation lands first on vLLM/SGLang; TGI's strength is "operationalization."
Chapter 10 · KTransformers (Tsinghua) — long-context optimized
KTransformers is an inference framework published in 2024 by the MADSys lab at Tsinghua University. Its focus is clear — run a giant MoE model on one consumer GPU.
Key tricks:
- Expert offloading — keep MoE expert weights on CPU RAM, pull them to GPU only when activated.
- CPU SIMD acceleration (AMX, AVX-512) — run sparse activations efficiently on CPU.
- Long-context optimization — combined chunked prefill and local-global attention.
- DeepSeek and Qwen MoE specialization — demos running 671B DeepSeek-V3/R1-class models on a 24GB consumer GPU with large RAM (192GB to 512GB) made waves.
# DeepSeek-V3 on RTX 4090 plus large system RAM
git clone https://github.com/kvcache-ai/ktransformers
cd ktransformers
pip install -e .
python -m ktransformers.local_chat \
--model_path /path/to/DeepSeek-V3-GGUF \
--gguf_path /path/to/DeepSeek-V3-GGUF \
--cpu_infer 32
Why KTransformers matters:
- With giant OSS MoE models in 2025 (DeepSeek-V3/R1) and 2026 (Qwen3-MoE), the real question became how do individuals run them. KTransformers is one answer.
- The single GPU plus large RAM combo runs giant models without an H100 cluster.
Weakness: multi-user serving and throughput are still better on vLLM. There's an install and tuning learning curve.
When KTransformers: you want to run a giant MoE model as a single user without access to a GPU cluster; long-context workloads.
Chapter 11 · NVIDIA Triton + TensorRT-LLM — the datacenter standard
NVIDIA-camp full-stack. The two are often used as a pair.
TensorRT-LLM
NVIDIA's OSS LLM inference library. It compiles PyTorch models into TensorRT engines (NVIDIA-only IR) for acceleration in NVIDIA-pushed precisions like FP16, INT8, INT4, FP8, and NVFP4. Its strength is quickly exploiting new instructions on Hopper (H100/H200) and Blackwell (B100/B200, GB200).
- In-flight batching, paged KV, speculative decoding, chunked prefill — every accel vLLM has.
- FP8/FP4 acceleration — large throughput gains on Hopper/Blackwell.
- Multi-GPU and multi-node — full support for Tensor, Pipeline, and Expert parallelism.
Triton Inference Server
NVIDIA's general inference server. Beyond LLMs, it serves vision, tabular, and custom models multi-framework in one server. You can use the TensorRT-LLM backend for LLM mode, or directly run TensorRT-LLM's OpenAI-compatible server (trtllm-serve).
# Build the model into a TensorRT-LLM engine
trtllm-build --checkpoint_dir ./llama3-70b \
--output_dir ./engines/llama3-70b \
--gemm_plugin auto \
--max_batch_size 32
# Serve with trtllm-serve (OpenAI compatible)
trtllm-serve ./engines/llama3-70b --port 8000
vLLM vs TensorRT-LLM in one line: vLLM is OSS, spans many GPUs (NVIDIA, AMD, Intel, TPU), and is easier to approach. TensorRT-LLM is NVIDIA-only but squeezes the last drop out of NVIDIA GPUs (especially FP8/FP4). Many production teams keep both and benchmark with their own workload.
When TensorRT-LLM/Triton: NVIDIA Hopper/Blackwell GPU clusters, absolute throughput and latency in production, an ops team familiar with the NVIDIA ecosystem.
Chapter 12 · Modular MAX (Mojo team) — the new entry
Modular is the company Chris Lattner (LLVM, Swift, Clang) founded in 2022. They build Mojo (a high-performance systems language aiming for Python compatibility) and MAX (Modular Accelerated Xecution, an AI inference platform). The OSS transition for MAX started in 2024, and by 2026 the pitch is one graph compiler for inference across NVIDIA, AMD, Intel, and Apple.
Core idea:
- Mojo-written kernels — write like Python with C/CUDA-level performance. Core kernels like matmul and attention written in Mojo compile to multiple accelerators.
- MAX Engine — graph compiler. Accepts PyTorch/ONNX models and optimizes for NVIDIA, AMD, or Apple GPUs.
- MAX Serve — OpenAI-compatible LLM server.
- Break hardware lock-in — "fast on NVIDIA GPUs without CUDA" as the slogan.
pip install modular
max serve --model-path meta-llama/Llama-3.3-70B-Instruct
Modular in 2026:
- Throughput competitive with vLLM on both NVIDIA and AMD GPUs is starting to land on certain workloads.
- Adopted as backend by some clouds (Lambda, formerly Lepton).
- Academic and OSS community adoption is still smaller than vLLM/SGLang, but "Mojo + MAX" is recognized as a next-gen camp.
Weakness: Mojo's OSS transition is phased, so the community is still small. Model coverage and feature breadth are wider on vLLM.
When MAX: you want a faster path than vLLM's ROCm backend on AMD GPUs; you want to be an early adopter of the Mojo ecosystem; you want to avoid NVIDIA-only lock-in.
Chapter 13 · Quantization — GGUF / AWQ / GPTQ / FP8
You can't pick a serving framework without also picking a quantization format. Five mainstream choices in 2026.
GGUF (llama.cpp family)
- The unified format llama.cpp created. Weights + metadata + tokenizer in one file.
- Rich quantization tiers —
Q2_K,Q3_K_S/M/L,Q4_K_S/M,Q5_K_S/M,Q6_K,Q8_0,IQ2_XXS,IQ3_XXS,IQ4_XS. - Practical recommendation: for 7B to 70B,
Q4_K_Mis widely considered the quality/size sweet spot. Go smaller withIQ3_XXS(2–3-bit imatrix), larger withQ5_K_MorQ6_K. - Where it runs well: llama.cpp, Ollama, LM Studio, llamafile, KTransformers.
AWQ (Activation-aware Weight Quantization)
- 2023 MIT Han Lab paper. Preserves important weights by looking at activation distributions, then quantizes to 4-bit.
- Quality preservation is well-regarded, and it's optimized for GPU inference.
- Where: vLLM, SGLang, TGI, TensorRT-LLM all first-class support AWQ.
GPTQ
- 2022 ETH Zürich paper. Layer-wise OBS (Optimal Brain Surgeon) 4-bit quantization.
- Predates AWQ and was the standard for a while. In 2026 it's lost some share to AWQ but is still widely used.
- Where: vLLM, SGLang, TGI, AutoGPTQ.
FP8
- 8-bit floating point. Landed as hardware on Hopper (H100) and Ada (L40S) first, and Blackwell brought FP4 too.
- Called "quantization" but maintains numeric precision closer to training. Large throughput gains in big production setups.
- Where: TensorRT-LLM, vLLM, SGLang FP8 paths.
NVFP4 / MXFP4
- 4-bit floating point standardized by NVIDIA and OCP (Open Compute) in 2024–2025. Tied to new Blackwell GPU instructions.
- In 2026 some production models are starting to move to NVFP4/MXFP4 serving.
One-line recommendations:
- Laptop/local — GGUF
Q4_K_M - OSS 70B on vLLM/SGLang — AWQ 4-bit (or GPTQ)
- H100 production — FP8
- B100/B200 — NVFP4 / FP8 mix
Chapter 14 · Cloud serving SaaS — Together / Fireworks / Groq / Cerebras / Lepton
The camp for teams who don't want to operate GPUs themselves. The texture is "call OSS models with one API."
Together AI
The big sibling of OSS model hosting. 200+ models like Llama, Mistral, Qwen, DeepSeek via OpenAI-compatible APIs. Fine-tune, dedicated endpoints, dedicated deployments. Academic collaboration (joint training of models like RedPajama and Stripedhyena) is another strength.
Fireworks AI
Evaluated for fast throughput and quality function calling and structured output. Strong self-hosted training and fine-tune tooling. They also sell an inference stack optimized for Mixture-of-Experts and long-context.
Groq
The LPU (Language Processing Unit) custom chip. Massive SRAM and deterministic execution serve the same model at overwhelmingly low latency (hundreds of tokens per second). OSS Llama and Mixtral focus; model breadth is limited but on speed they're almost always a top contender.
Cerebras
The wafer-scale engine — one wafer is one chip (WSE-3 has ~4 trillion transistors). A giant single chip means less model partitioning and communication overhead than an H100 cluster. They advertise fast inference throughput in their cloud, and some are used for huge training jobs by governments and research labs.
SambaNova
The custom RDU (Reconfigurable Dataflow Unit) chip. Dataflow architecture optimized for dense and sparse LLM workloads. Enterprise and government markets primarily.
Lepton AI (NVIDIA acquisition, 2025)
A cloud inference platform founded by Jia Yangqing (original Caffe author, Meta, Alibaba) and others. NVIDIA acquired it in 2025, and it now sits as the inference layer of NVIDIA's cloud ecosystem. Integrated with NVIDIA DGX Cloud and NIM (NVIDIA Inference Microservices), it's settling in as "OSS model inference run by NVIDIA directly."
One-line comparison
- Fastest (low latency) — Groq (or Cerebras)
- Broadest OSS model coverage — Together
- Fine-tune and structured output quality — Fireworks
- Giant model single chip — Cerebras and SambaNova (enterprise)
- NVIDIA ecosystem integration — Lepton (NVIDIA) plus NIM
Chapter 15 · Korea / Japan — Upstage Solar / KT Mi:dm / Sakana / NTT Tsuzumi / ELYZA
You miss half the story if you only watch English models. The Korean and Japanese OSS and semi-OSS camps cemented their position in 2026.
Korea
- Upstage Solar — the Solar series from Upstage (Solar 10.7B, Pro models, and so on). Korean/English bilingual performance at small sizes is well-regarded, with active MoE and long-context variants. Strong enterprise channels via Hugging Face partnership and AWS Marketplace presence.
- KT Mi:dm (믿:음) — KT's large Korean-language model. Published and expanded from 2024, by 2026 it has settled into a multi-billion-parameter Korean-specialized lineup. On-prem adoption is growing in regulated industries like government, finance, and telecom.
- NAVER HyperCLOVA X — direct OSS is limited, but bundled with NAVER Cloud Platform it's a major axis of the Korean enterprise market. Multi-modal and smaller HCX-DASH variants are active.
- LG ExaOne — the LG AI Research ExaOne lineup. Some variants are published on Hugging Face as OSS.
These models often run on vLLM, SGLang, llama.cpp forks and branches with patches for Korean tokenizers and specific kernels. OSS compatibility varies per model, as do licenses (research-only, restricted commercial, fully OSS) — license review before adoption is mandatory.
Japan
- Sakana AI — founded by David Ha, Llion Jones (Transformer co-author), and others. Known for "evolutionary model merging," they publish Japanese-specialized small models and vision models as OSS. With 2025 NVIDIA and government support, they're the symbolic company of Japan's AI line.
- NTT Tsuzumi (つづみ) — NTT's Japanese-specialized LLM. Adopted in government and telecom markets where on-prem and domestic data residency governance demands are high.
- ELYZA — startup originating from the University of Tokyo. Continues to publish OSS series like ELYZA-japanese-Llama by doing Japanese continued pretraining on OSS bases like Llama and Qwen. The OSS line is maintained even after becoming a KDDI subsidiary in 2024.
- Preferred Networks PLaMo — the PLaMo 100B lineup of Japan-origin OSS base models.
Operationally, Korean and Japanese models typically ship both local serving (llama.cpp, Ollama, vLLM) and OSS hosting (Together, Fireworks, plus local clouds) compatible GGUF and safetensors variants. For regulated industries like government and banking, on-prem vLLM/TGI is essentially the default choice.
Chapter 16 · Who should pick what
Wrapping the 12 camps into situational recommendations.
Individual / hobbyist
- Mac laptop — MLX or LM Studio (MLX backend). Quantize with
Q4_K_Mor 4-bit MLX. - Windows/Linux laptop — Ollama or LM Studio. GGUF quantization.
- Five-minute start — Ollama with the one-line install.
- GUI preferred — LM Studio.
- Handing it to a non-technical user — llamafile.
Startup (10–50 people)
- Start with API, switch to self-hosted when cost explodes — start with Together, Fireworks, OpenAI, or Anthropic. Once token cost crosses $5,000/month, compare vLLM self-hosted.
- Self-host starting point — vLLM on AWS/GCP/Azure GPU (H100, L40S). AWQ or FP8 for OSS models.
- Agentic and tool-use heavy — keep SGLang on the shortlist.
- Local prototype plus cloud production — build on Mac with MLX or Ollama, then push the same model to a vLLM server for production.
Enterprise (regulated industries, finance, government)
- On-prem serving — vLLM or TGI. NVIDIA Triton plus TensorRT-LLM if you're on the NVIDIA line.
- Korean-specialized — Upstage Solar / KT Mi:dm / HyperCLOVA X on vLLM.
- Japanese-specialized — ELYZA / Sakana / Tsuzumi on vLLM or NTT's internal stack.
- Ops metrics and standardization — TGI or Triton (already NVIDIA).
- Model diversity (LLM + vision + tabular) — Triton (multi-framework).
Datacenter / hyperscale
- NVIDIA H100/B100 full stack — TensorRT-LLM plus Triton (or trtllm-serve). FP8/NVFP4.
- OSS first, multiple accelerators — vLLM (supports NVIDIA, AMD, Intel, TPU simultaneously).
- Structured / agentic — SGLang.
- Single-user giant MoE — KTransformers.
Avoiding hardware lock-in
- NVIDIA, AMD, Apple together — vLLM (broad coverage) or Modular MAX (compiler).
- Apple only — MLX.
- CPU-first / embedded — llama.cpp.
Epilogue — Models became commoditized; infra is the differentiator
In 2023, "which model" was the question. In 2026, "where and how do you run it" is the question. Models became plentiful (OSS 70B, even giant MoE), prices change by the hour, quantization quality has reached a level different from a year ago, and one GPU now plugs into a laptop.
The conclusion of this post is simple — benchmark against your own workload. Throughput, latency, cost, and operational complexity all come out different per workload, so measurement always beats recommendation. And the real message of this post might be that all 12 camps for that measurement are OSS in 2026.
What we have to do now isn't make GPUs run faster — it's decide which camp fits our problem within a week.
References
- vLLM — github.com/vllm-project/vllm / docs.vllm.ai
- vLLM PagedAttention paper — "Efficient Memory Management for Large Language Model Serving with PagedAttention"
- llama.cpp — github.com/ggerganov/llama.cpp
- MLX (Apple) — github.com/ml-explore/mlx / ml-explore.github.io/mlx
- mlx-lm — github.com/ml-explore/mlx-lm
- llamafile (Mozilla) — github.com/Mozilla-Ocho/llamafile
- Justine Tunney "LLaMA Now Goes Faster on CPUs" — justine.lol/matmul
- Ollama — ollama.com / github.com/ollama/ollama
- LM Studio — lmstudio.ai
- GPT4All (Nomic AI) — gpt4all.io / github.com/nomic-ai/gpt4all
- SGLang — github.com/sgl-project/sglang / lmsys.org/blog/2024-01-17-sglang
- TGI (Hugging Face) — github.com/huggingface/text-generation-inference
- KTransformers (Tsinghua) — github.com/kvcache-ai/ktransformers
- MLC LLM — github.com/mlc-ai/mlc-llm / llm.mlc.ai
- NVIDIA TensorRT-LLM — github.com/NVIDIA/TensorRT-LLM
- NVIDIA Triton Inference Server — github.com/triton-inference-server/server
- Modular MAX — docs.modular.com/max / modular.com
- AWQ paper — "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration"
- GPTQ paper — "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"
- FP8 Formats for Deep Learning — arxiv.org/abs/2209.05433
- Together AI — together.ai
- Fireworks AI — fireworks.ai
- Groq — groq.com
- Cerebras — cerebras.ai
- SambaNova — sambanova.ai
- Lepton AI (NVIDIA acquisition, 2025) — news.crunchbase.com Lepton coverage / lepton.ai
- NVIDIA NIM — nvidia.com/en-us/ai/
- Upstage Solar — upstage.ai / huggingface.co/upstage
- KT Mi:dm — kt.com AI / huggingface.co/KT-AI
- NAVER HyperCLOVA X — clova.ai / ncloud.com HyperCLOVA X
- LG ExaOne — lgresearch.ai exaone / huggingface.co/LGAI-EXAONE
- Sakana AI — sakana.ai
- NTT Tsuzumi — group.ntt tsuzumi
- ELYZA — elyza.ai / huggingface.co/elyza
- Preferred Networks PLaMo — preferred.jp / huggingface.co/pfnet