Skip to content
Published on

Open-Source LLMs 2026 Deep Dive - Llama 4 · DeepSeek V3 + R1 · Qwen 3 · Mistral Large 2 · Phi-4 · Gemma 3 · Falcon 3

Authors

Prologue — How the Gap of 2024 Disappeared

In spring 2024, the phrase "open-source LLM" came with a small sigh. Llama 2 70B versus GPT-4 was a comparison that ended with the consolation that a 7B model was better than a 13B model. One number of MMLU fell ten points, code generation often collapsed, Korean and Japanese ran at half capacity from day one. We reached for closed APIs while saying "someday."

In spring 2026, that gap is almost gone. Meta threw Llama 4 Scout, Maverick, and Behemoth with native multimodality and a 10-million-token context. DeepSeek shattered the cost curve with V3 671B and R1. Alibaba drove the Apache 2.0 flag deeper with Qwen 3. Mistral split its lineup further with Large 2, Pixtral, Codestral, and Ministral. Microsoft proved with Phi-4 14B that "a small model can go all the way with synthetic data." Google Gemma 3 redrew the edge possibility with native multimodal and 128K context on a single GPU. Falcon 3 and Falcon Mamba opened a different path with hybrid architectures. Allen AI OLMo 2 and Tülu 3 set a new baseline for "truly open" by releasing data, code, and checkpoints alongside weights.

This post ties all those models, licenses, inference stacks, local runtimes, and national-language models into one map. Korea's HyperCLOVA X, Kanana, and EXAONE 3.5. Japan's ELYZA, PLaMo, and Sakana. China's Yi, InternLM, and MiniCPM. And the inference engines — vLLM, SGLang, llama.cpp, MLX, exllamav2, TGI. The license map — Apache 2.0, MIT, Llama Community License, Gemma Terms, Mistral Research vs Commercial — gets organized at the end.


1. The 2026 Open-Source LLM Map — Three Axes, Five Categories

If you draw the 2026 open-source LLM market as a single map, three axes appear first.

Axis 1 — License freedom. Models with full commercial and redistribution freedom under Apache 2.0 / MIT (Qwen 3, Mistral 7B / Mixtral, DeepSeek R1, OLMo 2, Phi-4). Llama Community License with the 700M MAU threshold (Llama 4, Llama 3.3). Gemma Terms with acceptable-use policy attached (Gemma 3). Mistral Research License where commercial use needs a separate purchase (Mistral Large 2). CC-BY-NC for non-commercial only (Cohere Command R+). The same word "open" produces different colors when legal looks at them.

Axis 2 — Architecture and size. Dense transformers (Llama 3.3 70B, Mistral Large 2 123B, Qwen 2.5 72B, Phi-4 14B, Gemma 3 27B). MoE / Mixture of Experts (Llama 4 Maverick 400B / 17B active, Llama 4 Behemoth 2T, DeepSeek V3 671B / 37B active, Mixtral 8x7B / 8x22B). Hybrid (Falcon Mamba 7B with SSM). Reasoning-specific (R1, R1-Distill). The same "70B" number for a 17B-active MoE versus a 70B dense gives completely different GPU-memory curves.

Axis 3 — How open the data is. Weights-only open: most of Llama, Qwen, Mistral, DeepSeek. Weights + code + data + checkpoints (fully open): Allen AI OLMo 2, Tülu 3, Together RedPajama. EleutherAI Pythia and BigScience BLOOM live on this lineage. For genuinely reproducible science, fully open is the only answer.

Five categories sit on top.

  1. General flagship — Llama 4 Maverick, DeepSeek V3, Qwen 3, Mistral Large 2
  2. Small-and-mighty — Phi-4 14B, Gemma 3 4B / 12B / 27B, Llama 3.2 1B / 3B, Ministral 3B / 8B, MiniCPM 3.0
  3. Code specialists — Qwen 2.5 Coder, Codestral 25.01, DeepSeek Coder V2, Llama Code 3
  4. Reasoning specialists — DeepSeek R1, R1-Distill-Qwen-32B, QwQ-32B, Marco-o1
  5. Multimodal — Llama 4 Scout / Maverick (native), Pixtral 12B / Large, Gemma 3 (vision), Qwen 2.5 VL, NVLM

The core of the map — there is no single correct model. The coordinate where license, size, domain, and infrastructure meet produces a different choice every time.


2. Meta Llama 4 — Scout, Maverick, Behemoth

Meta launched the Llama 4 family in April 2025, and the baseline for the open-source flagship was redrawn from that moment. Three models in one bundle.

Llama 4 Scout — 109B total parameters, 17B active in MoE. 16 experts. Native multimodal (text + image). A 10-million-token context is the headline weapon. Designed to fit on a single H100 80GB with INT4. Whole codebases, an entire book, multiple PDFs go into the context at once.

Llama 4 Maverick — 400B total, 17B active MoE. 128 experts. 1M context. Multimodal. Fights in the same territory as GPT-4o and Gemini 2.0 Pro on flagship reasoning, coding, and creativity. Pulled other open models behind by double-digit ELO on LMSYS Chatbot Arena.

Llama 4 Behemoth — about 2 trillion (2T) total parameters, 288B active. Still in training as a giant teacher for Maverick and Scout via distillation. Meta directly stated it "competes with GPT-4.5 and Claude 3.7 Sonnet on STEM benchmarks."

All three ship under Llama Community License 4. Operators with under 700 million monthly active users get commercial use; anything above needs a separate license. A separate EU-residency clause is attached for EU users and companies due to data-protection rules.

Training infrastructure started with 32K H100s and Behemoth went beyond. FP8 mixed precision and a new technique (MetaP) for stabilizing the MoE router are in. The dataset goes beyond 30 trillion (30T) tokens, covering 200+ languages.

The key decisions in Llama 4: adopt MoE wholesale, native multimodality, leap in context length. The previous dense 70B / 405B is dropped in favor of MoE. Image input is handled inside the same model via early fusion, not via an external encoder.


3. Llama 3.3 70B — The Last Peak of the Dense Baseline

Before Llama 4, Meta dropped Llama 3.3 70B Instruct in December 2024 — the last peak of the dense transformer. The core claim: 405B-class performance compressed into 70B.

Spec summary — 70B parameters, dense, 128K context, English-centric but eight major languages. Grouped-query attention (GQA), RoPE positional encoding, RMSNorm.

Performance coordinates — MMLU 86.0, IFEval 92.1, HumanEval 88.4, MATH 77.0. Lags 405B by under five points while running on roughly one-sixth the GPU memory. With 4-bit quantization it fits on a single H100 80GB comfortably.

Deployment-friendliness — Because it is dense, vLLM, TGI, and llama.cpp run it most stably. No MoE-router overhead so latency is consistent. For inference systems, "predictable model" carries real value.

Even in 2026, Llama 3.3 70B is the default in production in many places. Llama 4 Scout brings multimodality and long context, but for plain text tasks, low latency, and predictable cost, falling back to 3.3 70B is common. Llama 3.2's 1B / 3B for edge and mobile, Llama 3.2 Vision 11B / 90B as multimodal helpers.

License is the same Llama Community License. Download directly from Hugging Face, or use Together, Fireworks, DeepInfra, Replicate, or Groq at token-rate hosting.


4. DeepSeek V3 — 671B MoE and a Cost Shock

DeepSeek-V3 dropped in December 2024, and at that moment the economics of open-source LLM training shifted. One number is enough — about 5.6 million USD. Training a 671B parameter, 37B active MoE model on that budget shocked an industry that was spending over 100 million on GPT-4-class training.

Architecture — 671B total, 37B active per token. 256 routed experts + 1 shared. Multi-head Latent Attention (MLA) compresses the KV cache. Multi-Token Prediction (MTP) objective improves training efficiency. FP8 mixed precision throughout training.

Training infrastructure — 2048 H800 GPUs (the export-control compliant SKU for China). 14.8 trillion (14.8T) tokens. Pre-training 2.64M GPU-hours + context extension 119K GPU-hours + post-training 5K GPU-hours = about 2.78M GPU-hours total. At an H800 rate of $2 per hour, that totals approximately 5.6 million dollars.

Performance coordinates — MMLU 88.5, MMLU-Pro 75.9, GPQA-Diamond 59.1, HumanEval 65.2, MATH 90.2, AIME 2024 39.2. Strong in both English and Chinese, especially solid in math and code.

License — DeepSeek's own license (MIT-derivative). Commercial use allowed. Weights download directly from deepseek-ai/DeepSeek-V3 on Hugging Face.

V3's real impact is the cost. Showing that the same scale can be trained at the same budget broke a closed assumption and undid the proposition that "giant-model training is the exclusive realm of Big Tech." After that, every open-source training report has "cost efficiency relative to DeepSeek" as a new baseline.


5. DeepSeek R1 + R1-Distill — The Open Baseline for Reasoning

DeepSeek-R1 launched in January 2025. If V3 shook the cost, R1 broke the closed monopoly on the reasoning category. The chain-of-thought reasoning held by OpenAI's o1 / o3 lines was dragged into the same arena under open weights.

Training recipe — DeepSeek-R1-Zero learned reasoning purely via RL (GRPO: Group Relative Policy Optimization) with no SFT. R1 added cold-start SFT then RL, capturing both accuracy and readability. The aha moment — the model spontaneously starts saying "wait, let me reconsider" during RL and corrects its reasoning path.

Performance coordinates — AIME 2024 79.8, MATH-500 97.3, Codeforces 96.3 percentile, GPQA-Diamond 71.5. Fights in the same arena as OpenAI o1.

MIT License — Weights and code both released under MIT. Commercial use, redistribution, derivatives — all free. One of the most permissive licenses in open-source reasoning-model history.

R1-Distill family — A lineup that distills R1's reasoning data into smaller base models.

  • DeepSeek-R1-Distill-Qwen-1.5B / 7B / 14B / 32B
  • DeepSeek-R1-Distill-Llama-8B / 70B

R1-Distill-Qwen-32B hits AIME 2024 72.6 and MATH-500 94.3. The point: a 32B that fits on a single H100 does OpenAI o1-mini-class reasoning. The baseline for local reasoning models jumped overnight.

In 2026, R1 and R1-Distill are the starting point of every open-source reasoning experiment. The Hugging Face Open-R1 project is reproducing the R1 training recipe in full open form, and countless derivatives are stacking on top.


6. Alibaba Qwen 3 / Qwen 2.5 — The Depth of Apache 2.0

Alibaba's Qwen series sits at the top of the license-freedom camp in open-source LLMs. The decision to ship weights under Apache 2.0 makes the entire difference.

Qwen 3 — Released in 2025. A lineup that covers both dense and MoE.

  • Dense: 0.5B / 1.8B / 4B / 7B / 14B / 32B / 72B
  • MoE: 30B total / 3B active, 235B total / 22B active

Qwen 2.5 — September 2024. Seven dense tiers from 0.5B to 72B. 18T tokens of training. 128K context (7B and above). 29+ languages.

Qwen 2.5 Coder — Code specialist line. 1.5B / 3B / 7B / 14B / 32B. HumanEval 92.7 (32B), top of open coding models on BigCodeBench and LiveCodeBench. Most frequently cited as a self-hosted alternative to GitHub Copilot.

Qwen 2.5 Math — Math specialist. 1.5B / 7B / 72B. Top of MATH benchmarks.

Qwen 2.5 VL — Multimodal. 3B / 7B / 72B. Image, video, document understanding.

QwQ-32B — Reasoning specialist. The open reasoning model that directly competes with R1. AIME 50.0+.

Qwen's position is clear. At every size class, it is the most license-permissive open model. Where Llama gets tied to Community License, Qwen is Apache 2.0. Where Mistral Large 2 restricts commercial use under Research License, the same-class Qwen 72B grants the same freedom for commercial deployment. From a legal standpoint, Qwen produces the least friction in the decision.

Chinese and Asian-language strength is a natural byproduct. Korean and Japanese quality at the same model size tends to be a notch above same-size Llama.


7. Mistral Large 2 — 123B and the Two Faces of the License

Mistral AI is France's pride and the other axis of the open-weight camp. Mistral Large 2 (Mistral-Large-Instruct-2407, July 2024) is a 123B dense model, 128K context, 80+ languages.

Performance coordinates — MMLU 84.0, MATH 71.5, HumanEval 92.0, strong on MultiPL-E multilingual code benchmark. At the time of release, it sat in the top five of open-weight models on LLM Arena ELO.

Two faces of the license — Mistral Large 2 ships under Mistral Research License. Research and non-commercial use is free; commercial use requires purchasing a separate Mistral Commercial License. Not Apache 2.0, one step more restrictive than Llama Community License. Weights download from Hugging Face but production revenue forces license negotiation.

Pixtral — Mistral's multimodal line.

  • Pixtral 12B (Apache 2.0) — A 12B open multimodal model
  • Pixtral Large (124B, Research License) — Flagship multimodal stacking a vision encoder on Large 2

Codestral 25.01 — Code specialist. 80+ programming languages. Strong on fill-in-the-middle (FIM). 32K context.

Ministral 3B / 8B — Edge dedicated. Mobile and on-device. License-restricted compared to Apache 2.0 camp but quality comparable to same-size Llama 3.2.

Mistral 7B / Mixtral 8x7B / 8x22B — Flagships of 2023-2024. Still alive as Apache 2.0 assets. Mixtral's SMoE (Sparse MoE) architecture is the reference for every subsequent open MoE.

Mistral runs two clear tracks on the license front — small and legacy models stay on Apache 2.0 to preserve community trust, while flagships split into Research / Commercial for revenue. The use decision always pairs the revenue threshold with the license clauses.


8. Microsoft Phi-4 — A 14B That Goes All the Way on Synthetic Data

The core hypothesis of the Microsoft Phi series is simple — "data quality dominates model size." Phi-4 (December 2024) pushes that hypothesis the furthest with 14B dense parameters.

Spec — 14B dense parameters, 16K context, 9.8T tokens of training. Most of the training data is synthetic — reasoning, math, and code examples generated by a larger model (GPT-4-class) and refined into a curated set.

Performance coordinates — MMLU 84.8, MATH 80.4, HumanEval 82.6, GPQA 56.1. A 14B that fights some benchmarks in the same arena as 70B-class models. Especially strong in reasoning and math.

LicenseMIT License. Weights and commercial use both free. One step more open than Llama Community License, two steps more open than Mistral Research License.

Phi-4 lineup extension — Phi-4-mini, Phi-4-multimodal, Phi-3.5-MoE. The chain of small model + synthetic data keeps extending.

Phi-4's value is twofold. (1) Fits on a single 24GB GPU with 4-bit quantization, becoming a default for local inference and on-prem deployment. (2) The synthetic-data recipe is published — the details of how data was generated, filtered, and refined directly inspire small-model training elsewhere.

The limit is clear — multilingual is weak. English-centric synthetic training means Korean, Japanese, and Chinese quality lags same-size Qwen 2.5. Where Phi-4's cost-effectiveness shines: English-only domains and as a small fine-tune base.


9. Google Gemma 3 — The Day Multimodal Fit on a Single GPU

Google's Gemma 3 launched in March 2025, and from that moment "multimodal that fits on a single GPU" became the new baseline.

Lineup — 1B / 4B / 12B / 27B. Gemma 3 27B is the flagship.

Core feature bundle.

  • Multimodal — 4B and above include a vision encoder. Image input.
  • 128K context — On the 27B. 1B is 32K.
  • 140+ languages — A solid multilingual core.
  • Function calling — Structured outputs and tool calls.
  • Quantization-friendly — A 27B fits on a single RTX 4090 with 4-bit GGUF.

Performance coordinates — Gemma 3 27B: MMLU 76.9, MATH 50.0, HumanEval 71.9, LMSYS Arena ELO 1338. Top of the same 27B class. Approaches parts of Llama 3.1 70B with the same 27B.

LicenseGemma Terms of Use. Tied to an Acceptable Use Policy but commercial use is allowed. Slightly narrower than Apache 2.0 and similar to Llama Community License.

Gemma 3's value is in the combination of size and multimodal. Vision, multi-language, and 128K context all pack into a 27B, making it the default multimodal for single-GPU on-prem deployment. 4B serves lighter edge and robotics scenarios.

Gemma 2 (June 2024, 9B / 27B) is still alive as a lighter base. PaliGemma 2 is the vision-only variant; CodeGemma 2 is the code specialist.


10. TII Falcon 3 / Falcon Mamba — The Hybrid Path

The Falcon series from the UAE's Technology Innovation Institute (TII) had its moment in 2023 with Falcon 40B and reorganized its lineup in December 2024 with Falcon 3 (1B / 3B / 7B / 10B dense).

Falcon 3 — Apache 2.0. Trained on 14T tokens. 32K context. MMLU 71+ (10B). Multilingual. First-class support for English, Arabic, French, Spanish, Portuguese.

Falcon Mamba 7B — The key. First pure Mamba State-Space architecture competing with 7B transformers. Uses SSM's linear scaling instead of transformer's quadratic attention. At long contexts, memory and time complexity are far more favorable than transformers.

Falcon 3 7B-Hybrid — Mamba + transformer hybrid. Aims for SSM's efficiency and transformer's accuracy simultaneously.

Falcon's position narrows to two things. (1) The flag of Arabic and Middle-Eastern LLMs — Arabic training weight is overwhelming compared to other global models. (2) The largest open base for Mamba / SSM experiments — while Mistral and Llama hold the transformer orthodoxy, Falcon Mamba carries the SSM banner.

Absolute benchmark performance lags Llama 3.3 and Qwen 2.5, but for researchers trying new architectures, it is the most frequently cited starting point.


11. Allen AI OLMo 2 + Tülu 3 — The Baseline of "Truly Open"

Among open-source LLMs, "is it really fully open" has its own bar. A weights-only-open model and one with weights, code, data, and intermediate checkpoints all open are scientifically different. That baseline is Allen Institute for AI's OLMo.

OLMo 2 (November 2024) — 7B / 13B parameters. The full 5T-token training dataset is published (Dolma 2), full code is published, hundreds of intermediate training checkpoints are published, training logs and evaluation scripts are published. Apache 2.0.

Tülu 3 (November 2024) — The post-training recipe applied on top of OLMo 2. The entire pipeline of SFT + DPO + PPO is open in code and data. Tülu 3 70B (applied on Llama 3.1 70B) competes with GPT-4o-mini.

OLMo and Tülu's value is reproducibility. Other open-weight models only show "this came out"; OLMo provides the science of "you can retrain identically with this data, this code, these hyperparameters." Training-dynamics research, scaling-law verification, pre-training data effect analysis — these are nearly impossible without fully open models.

Other assets in the fully-open camp.

  • EleutherAI Pythia (2023) — A GPT-NeoX base across 13 checkpoint scales. The standard dataset for training-dynamics research.
  • BigScience BLOOM (2022) — 176B multilingual. ROOTS training dataset released.
  • Together RedPajama — Open reproduction of the Llama pre-training dataset.
  • Stability AI StableLM — Weights and partial code released. Activity now slow.

The baseline for commercial production is usually Llama 4, Qwen 3, or Mistral, but for research, teaching, and reproducibility, the OLMo / Tülu line is the answer.


12. Korean Models — HyperCLOVA X, Kanana, EXAONE 3.5, VARCO, Luxia, Solar

The Korean-language LLM landscape is rich in 2026. On Korean alone, national models often sit a step ahead of global models in many areas.

Naver HyperCLOVA X — Naver's flagship. Operates closed APIs like HCX-003 and HCX-005 alongside open lines like HyperCLOVA X SEED. Highest ratings on Korean naturalness and cultural context understanding.

Kakao Kanana — Kakao Brain's open line. Kanana 1.5 8B / 32B. Built on Kakao Brain's long-standing LLM assets (KoGPT etc.) as Korean-specialized. License freedom close to the Apache 2.0 camp.

LG AI Research EXAONE 3.5 — December 2024. Three tiers at 2.4B / 7.8B / 32B. English-Korean balance, function calling, long context (32K). Runs EXAONE Deep as a reasoning-specialist variant. Goes into LG's in-house applications (LG U+, LG Household & Health Care, LG Electronics).

NCsoft VARCO 13B / VARCO LLM — NCsoft's game and content domain specialization. Strong in character dialogue, scenarios, multi-turn conversations.

Saltlux Luxia / Saltlux LLM — Saltlux's enterprise Korean LLM. Strong fit in finance, legal, public sector.

Upstage Solar 10.7B — Key asset from 2023-2024. 10.7B trained via depth up-scaling. Strong in both Korean and English. Some weights are released under OpenAccess. Upstage Solar Mini and Solar Pro extend the line.

KIST, ETRI, KORANI, National Institute of Korean Language models — Academic and government accumulate Korean data and model assets separately.

The key in choice is (1) absolute level of Korean naturalness, (2) domestic cloud and data sovereignty requirements (Naver Cloud, KT Cloud, NCloud), (3) commercial friendliness of the license. Domains where global models fall short on Korean clearly exist.


13. Japanese Models — ELYZA, PLaMo, rinna, Stockmark, Sakana

Japan's open-source LLM camp is also abundant.

ELYZA-japanese-Llama-2/3 — ELYZA's line layering Japanese continued pre-training and SFT on Llama bases. 7B / 13B / 70B. The de facto standard Japanese Llama variant in the Japanese market.

PFN PLaMo — Preferred Networks' PLaMo series. PLaMo-13B, PLaMo β, PLaMo Lite. Japanese-only training route. Own data, own infrastructure.

rinna — rinna's Japanese model line. RWKV-based Japanese models, Japanese GPT, Bilingual GPT, Llama variants. Strong in Japanese speech and character applications.

Stockmark LLM — Stockmark's Japan business news and market information domain specialization. Trained on 100B-token Japanese news data.

Sakana AI — Tokyo-based. Evolutionary model merging — meta technique that auto-merges multiple models via evolutionary algorithms. Released EvoLLM-JP and other Japanese merged models. Merging and evolution techniques are the weapon, not a single model.

ABEJA QwenJP, CyberAgent CALM2, Lightblue Karasu etc. — Base variants from Japanese startup camp.

Japan, like Korea, is decisive on (1) local Japanese naturalness, (2) data sovereignty and made-in-Japan policy, (3) anime, manga, and game domain specialization.


14. Chinese Models — Yi, InternLM, MiniCPM, Baichuan

Chinese open-source LLMs go beyond Qwen and DeepSeek into multiple layers.

Yi 1.5 (01.AI) — 6B / 9B / 34B. Apache 2.0. Strong in English-Chinese balance. From Kai-Fu Lee's 01.AI.

InternLM 2.5 (Shanghai AI Lab) — 7B / 20B. 1M context variant (InternLM2-Wqx etc.). Strong in reasoning and tool use. Runs various variants (InternVL multimodal etc.).

MiniCPM 3.0 (OpenBMB / Tsinghua) — 4B / 8B. Edge LLM specialization. Mobile inference, quantization-friendly, multilingual. Some benchmark wins over same-size Llama.

Baichuan 3 / Baichuan-M1 / Baichuan2 — Baichuan AI. Variants specialized in medical, legal, and finance vertical domains. Strong in school and exam data training.

01.AI Yi-VL, InternVL, MiniCPM-V — The multimodal lines of the Chinese camp. Together with Qwen 2.5 VL, half of open multimodal.

ChatGLM (Zhipu) — The GLM series. GLM-4, ChatGLM3. English-Chinese balance.

Common features of Chinese models — (1) Top-class Chinese naturalness, (2) Fast lineup rotation (one step up every quarter), (3) Relatively free licenses (Apache 2.0 or own variant).

Separate from some US government export-control rules, many models have no restrictions on commercial use itself. Policy and legal review for multinational global deployment is a separate matter, but the weights license itself is free.


15. Inference Stack — vLLM, SGLang, llama.cpp, MLX, TGI

Receiving the weights is one thing; without an inference engine, no use. The 2026 open-source inference stack is multi-layered.

vLLM — UC Berkeley LMSYS's GPU serving engine. PagedAttention manages KV cache page-by-page; top throughput. Supports nearly all open models — Llama, Qwen, Mistral, Phi. The de facto standard for GPU serving. Built-in OpenAI-compatible API server.

SGLang — Another high-performance serving engine. RadixAttention makes prompt-prefix caching extreme. Strong in structured generation (JSON, regex-based decoding). A strong rival to vLLM.

Hugging Face TGI (Text Generation Inference) — HF's own serving. Backend of Inference Endpoints. A stable production default.

TensorRT-LLM (NVIDIA) — NVIDIA's official inference engine. Long build time, but on the same GPU achieves maximum throughput and lowest latency. The peak of production NVIDIA environments.

llama.cpp — Georgi Gerganov's C / C++ inference. GGUF format with diverse quantization (2 / 3 / 4 / 5 / 6 / 8-bit). CPU / CUDA / ROCm / Metal / Vulkan backends. Universal range from Apple Silicon and regular PCs to Raspberry Pi. Ollama, LM Studio, LocalAI all stack on top.

Apple MLX — Apple Silicon-only ML framework. M3 Max / M4 Ultra can run up to 70B models in INT4. mlx-examples has many Llama, Qwen, Mistral ports.

exllamav2 / exllamav3 — turboderp's GPU inference. GPTQ / EXL2 quantization formats only. On a single GPU, latency on quantized models is sometimes lower than vLLM. Optimal for local and small-scale workloads.

Ollama — User-friendly wrapper over llama.cpp. Models download and run with one line, like ollama run llama3.3:70b-instruct-q4_K_M. The mass entry to local and on-device workflows.

LMDeploy / OpenLLM / Ray Serve / Triton Inference Server — Other production serving options.

Selection criteria — vLLM or SGLang for large-scale cloud GPU serving, TensorRT-LLM for single-NVIDIA-instance optimization, llama.cpp / Ollama for local and on-device, MLX for Apple Silicon, exllamav2 for single RTX 4090 optimization.


16. Hosting Providers — Together, Fireworks, Groq, DeepInfra, Replicate

If you do not want to handle weights directly and prefer paying by tokens, hosting providers are the answer.

Together.ai — The widest catalog of open-source LLM hosting. Llama, Qwen, Mistral, DeepSeek, Falcon, Gemma — almost all here. OpenAI-compatible API. Also provides fine-tuning service (Together Tune).

Fireworks.ai — Specialized in high-performance serving. Function calling, structured output, low latency. Llama, Mistral, DeepSeek lineup focus.

Groq — Overwhelming token-generation speed on their own LPU (Language Processing Unit) chip. Llama, Mixtral, Gemma limited. Fastest hosting in tokens-per-second.

DeepInfra — Strongest price-performance. Same-class model pricing is the lowest. Llama, Qwen, Mistral, DeepSeek catalog.

Replicate — Catalog blended with multimodal and image-generation models. LLMs are handled, but strength shines when combined with vision and audio models.

OpenRouter — Routes multiple hosting providers through a single API. Auto-routes by price, latency, and availability.

Hugging Face Inference Endpoints / Serverless Inference — HF's official serving. Pro subscription lets you use larger models.

Cerebras Inference — Fast inference on Cerebras wafer-scale chips. Llama-centric.

SambaNova Cloud — Based on SambaNova's own RDU chips.

Selection criteria — catalog breadth: Together, speed: Groq / Cerebras, price: DeepInfra, multi-provider routing: OpenRouter, production stability: Fireworks / Together.


17. Quantization — GGUF, GPTQ, AWQ, FP8, INT4

Weights as-is are too big. A 70B is 140GB at fp16, 35GB at INT4. You need to know the quantization formats for local inference.

GGUF (llama.cpp) — Most universal. Q2_K / Q3_K_S/M/L / Q4_K_S/M / Q5_K_S/M / Q6_K / Q8_0 varied. Q4_K_M is the default for the quality-size balance. Community quantization hubs on Hugging Face from TheBloke, bartowski, mradermacher.

GPTQ — Group-wise quantization. 4-bit is default. exllamav2 is the main runtime. GPU only.

AWQ (Activation-aware Weight Quantization) — MIT's quantization algorithm. Sees activation distribution and preserves important weights. Supported across vLLM, llama.cpp, exllamav2.

EXL2 — exllamav2-only. Variable bit-rate (2.5 - 8 bpw) for more flexible distribution within the same model size. Strong for fitting fine-tuned models exactly into a single GPU memory.

FP8 — Native on H100 / H200 / MI300 next-gen GPUs. Training and inference both in FP8. DeepSeek V3 used FP8 from training.

INT4 (BitsAndBytes) — Tim Dettmers's quantization. Hugging Face Transformers integration. Saves base-model memory during fine-tuning (QLoRA).

bf16 / fp16 — Absolute baseline without quantization.

Selection criteria — local CPU / Apple Silicon: GGUF, single local GPU: EXL2 / GPTQ, vLLM serving: AWQ or GPTQ, H100 / H200 serving: FP8, absolute quality priority: bf16.


18. Benchmarks — MMLU, GPQA, HumanEval, IFEval, Arena

The benchmark bundle you meet when comparing open-source models.

MMLU (Massive Multitask Language Understanding) — 57 domains, multiple choice. Undergraduate-level general knowledge. Top models hit 88+. Saturated, with declining reliability.

MMLU-Pro — MMLU successor. Harder, with 10 multiple-choice options. More reasoning weight.

GPQA-Diamond — Graduate-level physics, chemistry, biology. Even human experts hit 60-70 percent. Top models 70+.

HumanEval — Python function coding, 164 problems. Saturated (90+).

BigCodeBench — HumanEval successor. Real library use and multi-step code. More realistic.

LiveCodeBench — New coding problems updated by time. Prevents data contamination.

MATH — Math competition. Five difficulty levels. Top models 80+.

AIME (American Invitational Math Exam) — Standard benchmark for reasoning models. o1 and R1 shine.

IFEval (Instruction Following) — Instruction following. Explicit instructions on format, length, language.

MT-Bench — Multi-turn dialogue. GPT-4 judges.

LMSYS Chatbot Arena — Real-user blind comparison. ELO ranking. The most trusted aggregate indicator.

ArenaHard — Difficulty-filtered variant of Arena.

Korean: HAERAE, KoBEST, KMMLU (50 Korean domains). Japanese: JMMLU, JGLUE. Chinese: C-Eval, CMMLU.

Trap in comparison — same benchmark score swings 5-10 points based on prompt format, few-shot count, and evaluation code. Rather than trusting model-card scores, measuring directly with standard tools like lm-evaluation-harness or OpenCompass is safer.


19. Fine-tuning — LoRA, QLoRA, DPO, GRPO

Receiving the weights is followed by fine-tuning to your own domain.

SFT (Supervised Fine-Tuning) — Most basic. (input, output) pairs update general weights. The SFTTrainer in transformers + trl is the standard.

LoRA (Low-Rank Adaptation) — Microsoft's PEFT technique. Trains only low-rank adapters instead of all weights. Even a 70B can be trained on one 8x A100 node. Adapters are typically tens of MB.

QLoRA — Tim Dettmers's variant. Trains LoRA adapters while the base model stays 4-bit quantized. A 70B fine-tune is possible on a single 24GB GPU.

DPO (Direct Preference Optimization) — Rafailov's alignment technique. Replaces PPO's reward model + RL stages with direct preference loss. Widely used as the next stage after SFT.

ORPO / KTO / IPO / SimPO — Variants of DPO. Preference data form and loss function differ slightly.

GRPO (Group Relative Policy Optimization) — DeepSeek R1's RL technique. Group-relative rewards without a reward model. Default for reasoning-model training.

RLAIF / Constitutional AI — Instead of human labeling in RLHF, AI generates comparison data directly. Cost reduction.

Tool bundle — Hugging Face transformers + peft + trl + accelerate + deepspeed is the standard stack. High-level wrappers like axolotl, unsloth, llama-factory stack on top. unsloth recently surged in popularity for boosting LoRA / QLoRA training speed 2-5x via kernel optimization.

Data synthesis — Tools like Magpie, Distilabel, Argilla automate fine-tune synthetic dataset generation. Phi-4's synthetic-data training recipe is the model in this direction.


20. Multimodal — Llama 4 Vision, Pixtral, Qwen 2.5 VL, NVLM, MiniCPM-V

Open multimodal LLMs are on track in 2026.

Llama 4 Scout / Maverick — Native multimodal. Early fusion processes image and text inside the same transformer. Structurally different from the LLaVA approach where a separate vision encoder is applied externally.

Pixtral 12B / Pixtral Large — Mistral's multimodal. Variable-resolution input. Pixtral 12B is Apache 2.0, Large is Research License.

Qwen 2.5 VL — Alibaba. 3B / 7B / 72B. A rare open model supporting video input as well. Strong in document OCR and chart understanding.

NVLM (NVIDIA) — NVIDIA's open multimodal. Decoder-only and cross-attention variants.

MiniCPM-V — OpenBMB's edge multimodal. Mobile and on-device vision-language.

InternVL 2.5 — Shanghai AI Lab. From 1B to 78B. Strong in video, OCR, charts.

LLaVA series, CogVLM, Yi-VL — Other open multimodal variants.

Gemma 3 Vision — Built into Gemma 3 4B and above.

Phi-4 Multimodal — Microsoft's multimodal variant.

Comparison criteria — (1) resolution and dynamic resolution support (decisive for high-resolution document OCR), (2) video input support, (3) chart, table, and formula understanding (numerical vision), (4) per-language OCR (non-Latin scripts like Korean, Chinese, Japanese, Arabic).


21. The License Map — Apache 2.0, MIT, Llama, Gemma, Mistral

Reducing open-source LLM licenses to five tiers from a revenue and legal viewpoint.

Tier 1 — Fully free: Apache 2.0, MIT, BSD. Commercial use, redistribution, derivatives all free. Qwen 2.5 / Qwen 3, Mistral 7B / Mixtral, DeepSeek R1, OLMo 2, Phi-4, Falcon 3, Pixtral 12B.

Tier 2 — Conditional usage policy: Llama Community License (700M MAU threshold), Gemma Terms (Acceptable Use Policy), Apple OpenELM license. Llama 3.x / Llama 4, Gemma 2 / Gemma 3.

Tier 3 — Free for research, commercial requires separate license: Mistral Research License. Mistral Large 2, Pixtral Large, Codestral (commercial requires separate purchase).

Tier 4 — Non-commercial only: CC-BY-NC, parts of OpenRAIL-M. Cohere Command R+ (CC-BY-NC).

Tier 5 — Closed API: Weights closed, tokens only. GPT-4o, Claude, Gemini.

The color legal sees is clear — Tier 1 has almost no friction, Tier 2 requires usage-policy review (verifying prohibited domains like military and biometric identification), Tier 3 requires commercial license negotiation, Tier 4 is unusable when revenue arises.

OpenRAIL-M, RAIL, and Hugging Face's BigScience BLOOM License variants are a bundle of "Responsible AI License" types. One additional clause on usage policy beyond Apache 2.0.

Basic selection — Tier 1 first if production with revenue, Tier 2 accepted if Llama-friendly tool ecosystem matters, Tier 3 prepare for license negotiation if Mistral quality is essential, Tier 4 possible for non-commercial research and internal tools.


22. Selection Matrix — What to Use When

A single table of all the models seen so far.

ScenarioFirst choiceSecond choiceNote
English general flagshipLlama 4 MaverickDeepSeek V3MoE
English simple tasks, low costLlama 3.3 70BMistral Large 2Dense
Korean top-classNaver HCX / KananaQwen 2.5 72BNational domain
Japanese top-classELYZA-Llama-3Qwen 2.5 72B-
Chinese top-classQwen 2.5 72BYi 1.5 34B-
Reasoning, mathDeepSeek R1QwQ-32BMIT license
Code generationQwen 2.5 Coder 32BCodestral 25.01FIM
Single H100 24GBPhi-4 14BGemma 3 12B Q4-
Multimodal single GPUGemma 3 27BQwen 2.5 VL 7B-
Multimodal flagshipLlama 4 MaverickQwen 2.5 VL 72BNative
Mobile, edgeLlama 3.2 3BPhi-4-miniQ4
Need full license freedomQwen 3 / Qwen 2.5Mistral 7BApache 2.0
Academic reproductionOLMo 2 + Tülu 3PythiaFully open
Fast token generationLlama on GroqLlama on CerebrasLPU
Apple Silicon localLlama 3.3 70B (MLX)Gemma 3 27B (MLX)M3 / M4
Training on one 5K-dollar GPUQLoRA Llama 70BLoRA Qwen 32Bunsloth
Pre-training on 10K GPUsLlama 4 fullOLMo 2 retrain-

Question fork.

  1. Commercial use, or non-commercial / research? -> Commercial caps at Tier 1-2; non-commercial opens up to Tier 4.
  2. Must fit on a single GPU? -> 24GB targets Phi-4, Gemma 3 12B, Llama 3.2. 80GB targets Llama 3.3 70B (Q4).
  3. Reasoning and math, or general? -> Reasoning takes R1 or QwQ; general takes Llama 4 or Qwen 3.
  4. For Korean, Chinese, Japanese, which is better: global or national? -> Korean usually Naver / Kakao / LG, Chinese is Qwen, Japanese is ELYZA.
  5. Need reproducible scholarship? -> OLMo 2 + Tülu 3.

23. Traps and Common Misconceptions

Common traps when operating open-source LLMs.

Trap 1 — Ambiguous definition of "open". Are weights-open and open the same? Are weights + code + data + checkpoints open? This definitional difference connects directly to academic reproducibility. Fully open is limited to OLMo, Tülu, Pythia, BLOOM.

Trap 2 — License trap. Hearing Llama is "open" and dropping it straight into a service, then crossing the threshold of revenue triggers separate license negotiation. Hearing Mistral Large 2 is "non-commercial" and using it in internal tools — once that tool faces external customers, license violation. Apache 2.0, Llama Community License, and Mistral Research License are not the same "open."

Trap 3 — Benchmark equals capability. The difference between MMLU 88 and 89 is within measurement noise. ArenaHard and LMSYS Arena ELO are more reliable. Direct testing in real use scenarios is necessary.

Trap 4 — Quantization is not free. Q4_K_M typically loses 1-2 points, Q3_K_S 5-10 points. The loss looks larger in reasoning scenarios. As quantization bits drop, hallucination and computational errors rise.

Trap 5 — Fine-tune is not a panacea. Fine-tuning with small domain data often breaks the base model's general ability — catastrophic forgetting. RAG is the answer more often than fine-tune is.

Trap 6 — Context length != effective context. Even a 1M-context model loses accuracy in needle-in-a-haystack tests in the late part of the context. Verify actual performance with long-context benchmarks like RULER and LongBench.

Trap 7 — Multi-GPU distribution is not simple. Tensor Parallel requires fast interconnect (NVLink) between GPUs. PCIe alone drops throughput. Pipeline Parallel is effective only in some model shapes.

Trap 8 — Misreading "DeepSeek 5.6M USD". That number is the GPU-hour cost of the last single pre-training. Infrastructure amortization, headcount, failed runs, and algorithm R&D costs are excluded. The true total cost is five to ten times that.

Trap 9 — Korean, Japanese, and Chinese models are not unconditionally better than global on their own language. Llama 4 and Qwen 3 use overwhelmingly more multilingual data, and gaps narrow in some areas. Real testing per domain and register is the answer.

Trap 10 — "Fully open" models are not always the answer. OLMo and Pythia have absolute academic value, but absolute performance lags Llama and Qwen. Production and research are different axes.


24. Conclusion — One Map, Five Forks

Reducing the 2026 spring open-source LLM landscape to one paragraph.

Flagships are Llama 4 Maverick, DeepSeek V3, Qwen 3, Mistral Large 2. Reasoning is DeepSeek R1 and R1-Distill, QwQ-32B. Code is Qwen 2.5 Coder and Codestral. Single-GPU multimodal is Gemma 3 27B. Cost-effective synthetic-data is Phi-4 14B. New architectures live in Falcon Mamba. Academic reproduction lives in OLMo 2 + Tülu 3. Korean is Naver HCX, Kakao Kanana, LG EXAONE 3.5. Japanese is ELYZA, PLaMo, rinna. Chinese is Qwen, Yi, InternLM, MiniCPM.

Inference stack is vLLM and SGLang as the GPU-serving standard, llama.cpp, MLX, exllamav2 as the local standard, Together, Fireworks, Groq, DeepInfra as the hosting standard. License is Apache 2.0 / MIT for no friction, Llama Community / Gemma Terms for conditional usage policy, Mistral Research for non-commercial limit.

The proposition from two years ago — "open source is the shadow of closed" — is gone. In spring 2026, open-source LLMs throw answers in the same arena, on the same benchmarks, to the same users as closed. At what coordinate you receive that answer — where license, size, domain, and infrastructure meet — that coordinate is half the workflow.


References