- Authors

- Name
- Youngju Kim
- @fjvbn20031
Table of Contents
- Why Open-Source LLMs?
- The Major Model Families
- Model Architecture Deep Dive
- Local Inference Tools
- Production Serving with vLLM
- Fine-Tuning with LoRA and QLoRA
- The Hugging Face Ecosystem
- Quantization Techniques
- Choosing the Right Model
- Open-Source vs Proprietary: When to Use What
1. Why Open-Source LLMs?
1.1 The Case for Open Models
For several years, proprietary models (GPT-4, Claude) dominated in quality. That gap has narrowed dramatically. In 2025–2026, the top open-source models rival proprietary offerings on most benchmarks, and in some specialized domains they surpass them.
Reasons to use open-source LLMs:
| Reason | Details |
|---|---|
| Data privacy | Your data never leaves your infrastructure |
| Cost at scale | No per-token charges; amortize GPU costs |
| Customization | Fine-tune on your own domain data |
| Compliance | Healthcare, finance, legal often require on-prem |
| Latency control | Co-locate model with your application |
| No vendor lock-in | Switch models without API changes |
| Research and transparency | Inspect weights, architecture, training data |
1.2 Definitions: Open vs Open-Weight vs Truly Open
The term "open-source" is used loosely in the LLM community:
- Truly open-source: Weights, training code, and training data all released (e.g., OLMo, Pythia).
- Open-weight: Weights available for download, but training data or code is not (e.g., Llama 3, Mistral, Gemma).
- Available weights with restrictions: Weights available but with use-case restrictions (e.g., Llama's community license excludes large commercial providers).
Most models called "open-source" are open-weight. This matters for legal and research purposes.
2. The Major Model Families
2.1 Meta Llama
The Llama family is the dominant open-weight model series, released by Meta AI.
Llama 3.1 / 3.2 / 3.3 (2024–2025)
| Model | Parameters | Context | Notes |
|---|---|---|---|
| Llama 3.1 8B | 8B | 128K | Best-in-class small model |
| Llama 3.1 70B | 70B | 128K | Rivals GPT-4o on many tasks |
| Llama 3.1 405B | 405B | 128K | Largest, near-frontier quality |
| Llama 3.2 1B / 3B | 1B, 3B | 128K | Edge deployment, multimodal |
| Llama 3.2 11B / 90B | 11B, 90B | 128K | Vision-language models |
| Llama 3.3 70B | 70B | 128K | Improved from 3.1 70B |
Key strengths: Strong reasoning, code, multilingual (8 languages). The 70B model is a go-to workhorse for most applications.
License: Llama Community License — free for most use cases; Meta approval required for products with 700M+ MAU.
2.2 Mistral AI
Mistral produces highly efficient models that punch above their parameter count.
| Model | Parameters | Notes |
|---|---|---|
| Mistral 7B v0.3 | 7B | Original efficient model, instruction-tuned |
| Mistral NeMo 12B | 12B | Collaboration with NVIDIA, strong coding |
| Mistral Small 3 | 24B | Efficient commercial-grade model |
| Codestral | 22B | Specialized for code, 80+ languages |
| Mixtral 8x7B | 56B (12.9B active) | Mixture of experts, fast inference |
| Mixtral 8x22B | 141B (39B active) | Best MoE general model |
License: Apache 2.0 for base models (permissive, commercial-friendly).
Mistral's key innovation: Mixture of Experts (MoE) architecture activates only a subset of parameters per token, giving near-70B quality at 7B active-parameter inference cost.
2.3 Google Gemma
Gemma is Google's open-weight model series based on Gemini technology.
| Model | Parameters | Notes |
|---|---|---|
| Gemma 2 2B | 2B | Best-in-class at 2B |
| Gemma 2 9B | 9B | Beats Llama 3.1 8B on several benchmarks |
| Gemma 2 27B | 27B | Strong, efficient 27B model |
| CodeGemma 7B | 7B | Code-specialized |
| PaliGemma | 3B | Vision-language model |
License: Gemma Terms of Use — permissive but not Apache 2.0. Commercial use allowed.
2.4 Qwen (Alibaba)
The Qwen series from Alibaba Cloud has become a top-tier open-weight family.
| Model | Parameters | Notes |
|---|---|---|
| Qwen2.5 0.5B–72B | Multiple sizes | Strong multilingual (Chinese/English) |
| Qwen2.5-Coder 7B–32B | 7B, 32B | Excellent code generation |
| Qwen2.5-Math 7B–72B | 7B, 72B | Mathematical reasoning |
| QwQ 32B | 32B | Reasoning model, o1-style chain-of-thought |
| Qwen2-VL | 7B, 72B | Strong vision-language |
License: Apache 2.0 for most models.
Qwen models are particularly strong for Chinese language tasks and multilingual applications.
2.5 DeepSeek
DeepSeek has produced remarkable models with efficient training approaches.
| Model | Parameters | Notes |
|---|---|---|
| DeepSeek-V2 | 236B MoE (21B active) | Very cost-effective inference |
| DeepSeek-V3 | 671B MoE (37B active) | Near-GPT-4 quality, open weights |
| DeepSeek-R1 | Various sizes | Reasoning model with visible CoT |
| DeepSeek-Coder-V2 | 236B MoE | Strong code generation |
License: DeepSeek model license — permissive for most commercial uses.
DeepSeek's key achievement: training frontier-quality models at dramatically lower compute cost than competitors.
2.6 Other Notable Models
| Model | Organization | Strength |
|---|---|---|
| Phi-3 / Phi-4 | Microsoft | Strong at small sizes (3.8B–14B) |
| Command R+ | Cohere | Retrieval-augmented tasks |
| Falcon 180B | TII | Large open model |
| Yi 34B | 01.AI | Strong multilingual |
| Orca 3 | Microsoft | Instruction following |
| SOLAR 10.7B | Upstage | Korean/English bilingual |
3. Model Architecture Deep Dive
3.1 Grouped Query Attention (GQA)
Most modern LLMs use GQA instead of standard multi-head attention (MHA). GQA groups queries to share key-value heads, significantly reducing KV cache memory without meaningful quality loss.
Multi-Head Attention (MHA):
Q heads: 32 K heads: 32 V heads: 32
KV cache: 2 × 32 × seq_len × d_head
Grouped Query Attention (GQA):
Q heads: 32 K heads: 8 V heads: 8
KV cache: 2 × 8 × seq_len × d_head (4× reduction!)
Multi-Query Attention (MQA):
Q heads: 32 K heads: 1 V heads: 1
KV cache: 2 × 1 × seq_len × d_head (32× reduction, some quality loss)
3.2 Rotary Position Embeddings (RoPE)
RoPE encodes position information into query and key vectors through rotation, rather than adding positional embeddings to token embeddings. Key advantages:
- Extrapolates to longer sequences than seen during training
- Efficient relative position computation
- Can be extended with YaRN, LongRoPE for very long contexts
3.3 Mixture of Experts (MoE)
MoE replaces dense feed-forward layers with multiple "expert" networks and a router that selects a subset per token.
# Simplified MoE layer concept
class MoELayer:
def __init__(self, num_experts=8, top_k=2, d_model=4096, d_ff=14336):
self.experts = [FeedForward(d_model, d_ff) for _ in range(num_experts)]
self.router = nn.Linear(d_model, num_experts)
self.top_k = top_k # Only activate top_k experts per token
def forward(self, x):
# Router selects top-k experts
logits = self.router(x)
weights, indices = logits.topk(self.top_k)
weights = F.softmax(weights, dim=-1)
# Weighted sum of selected expert outputs
output = sum(
weights[:, i] * self.experts[indices[:, i]](x)
for i in range(self.top_k)
)
return output
Mixtral 8x7B has 8 experts per layer, activating 2 per token. This gives it 56B total parameters but only ~12.9B active parameters during inference — much faster than a dense 56B model.
3.4 KV Cache and Context Length
The KV (key-value) cache stores computed attention keys and values from previous tokens, enabling autoregressive generation without recomputing the entire sequence at each step.
KV cache memory:
KV cache size = 2 × num_layers × num_kv_heads × head_dim × seq_len × dtype_bytes
Example: Llama 3.1 8B, fp16, 128K context
= 2 × 32 × 8 × 128 × 131072 × 2 bytes
= ~17.2 GB just for KV cache
This is why serving long-context requests is memory-intensive, and why quantized KV cache and sliding window attention matter.
4. Local Inference Tools
4.1 Ollama
Ollama is the easiest way to run LLMs locally. One command downloads and runs a model.
# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Run a model interactively
ollama run llama3.1:8b
# Run a specific quantization
ollama run llama3.1:70b-instruct-q4_K_M
# Pull without running
ollama pull mistral:7b
# List installed models
ollama list
# Run as API server (OpenAI-compatible!)
ollama serve # starts on localhost:11434
Ollama exposes an OpenAI-compatible API, so you can point any OpenAI SDK at it:
from openai import OpenAI
# Point at local Ollama
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but ignored
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Explain transformers briefly."}]
)
print(response.choices[0].message.content)
4.2 llama.cpp
llama.cpp is a C++ inference engine for GGUF-quantized models. It runs on CPU, GPU, or both, and is the engine powering Ollama under the hood.
# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_CUDA=ON # CUDA GPU support
cmake --build build --config Release -j
# Download a GGUF model (example)
# From Hugging Face: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
# Run inference
./build/bin/llama-cli \
-m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-n 512 \
--prompt "Explain the attention mechanism in transformers."
# Run as OpenAI-compatible server
./build/bin/llama-server \
-m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 4096 \
-ngl 35 # layers on GPU
4.3 Transformers (Hugging Face)
The Hugging Face transformers library provides the most flexible way to run models in Python:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto", # auto-splits across available GPUs
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the transformer architecture?"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
outputs = model.generate(
input_ids,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
4.4 Comparison of Local Inference Tools
| Tool | Best For | GPU Required | Setup Difficulty | Performance |
|---|---|---|---|---|
| Ollama | Developers, quick start | No (CPU works) | Very easy | Good |
| llama.cpp | CPU inference, embedding | Optional | Medium | Excellent |
| Transformers | Research, custom code | Recommended | Easy-Medium | Good |
| vLLM | Production serving | Yes | Medium | Excellent |
| TGI | Production serving | Yes | Medium | Excellent |
| ExLlamaV2 | High-throughput GPU | Yes | Medium-Hard | Excellent |
5. Production Serving with vLLM
5.1 Why vLLM?
vLLM is the leading open-source LLM serving engine for production use. Its key innovations:
- PagedAttention: Manages KV cache like OS virtual memory, dramatically increasing throughput.
- Continuous batching: Dynamically adds requests to running batches, maximizing GPU utilization.
- Tensor parallelism: Split model across multiple GPUs seamlessly.
- OpenAI-compatible API: Drop-in replacement for OpenAI API.
5.2 Starting a vLLM Server
# Install
pip install vllm
# Start server (single GPU)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype bfloat16 \
--max-model-len 32768 \
--port 8000
# Multi-GPU with tensor parallelism (4 GPUs)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--dtype bfloat16 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--port 8000
# With quantization (AWQ)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-3-70B-Instruct-AWQ \
--quantization awq \
--dtype float16 \
--tensor-parallel-size 2
5.3 Using the vLLM Server
Since vLLM exposes an OpenAI-compatible API, the client code is identical to OpenAI:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "What is PagedAttention?"}],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)
# Streaming
with client.chat.completions.stream(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain RAG in detail."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
5.4 vLLM in Docker
# Dockerfile
FROM vllm/vllm-openai:latest
ENV HUGGING_FACE_HUB_TOKEN=""
CMD ["--model", "meta-llama/Meta-Llama-3.1-8B-Instruct", \
"--dtype", "bfloat16", \
"--max-model-len", "16384"]
docker run --gpus all \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=hf_xxx \
-v ~/.cache/huggingface:/root/.cache/huggingface \
my-vllm-server
6. Fine-Tuning with LoRA and QLoRA
6.1 Why Fine-Tune?
Fine-tuning is not always needed. Use prompt engineering and RAG first. Fine-tune when:
- You need a specific output format that is hard to prompt into existence
- Your domain has terminology or conventions not well-represented in the base model
- Latency matters and you want to bake instructions into the model weights
- You have thousands of high-quality examples and want consistent behavior
6.2 LoRA: Low-Rank Adaptation
LoRA freezes the pre-trained model weights and adds small trainable matrices (adapters) to each attention layer. This reduces trainable parameters by 1000× or more.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank of adaptation matrices
lora_alpha=32, # Scaling factor
lora_dropout=0.05,
target_modules=[ # Which layers to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
bias="none",
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,114,474,240 || trainable%: 1.03%
6.3 QLoRA: Quantized LoRA
QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of 70B models on a single A100 GPU:
from transformers import BitsAndBytesConfig
from peft import prepare_model_for_kbit_training
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for k-bit training (casts layer norms to fp32, etc.)
model = prepare_model_for_kbit_training(model)
# Apply LoRA on top of quantized model
peft_model = get_peft_model(model, lora_config)
6.4 Full Fine-Tuning with SFTTrainer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("your-org/your-dataset", split="train")
def format_example(example):
return {
"text": f"<|user|>\n{example['instruction']}\n<|assistant|>\n{example['output']}"
}
dataset = dataset.map(format_example)
trainer = SFTTrainer(
model=peft_model,
tokenizer=tokenizer,
train_dataset=dataset,
args=SFTConfig(
output_dir="./checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_ratio=0.03,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=100,
dataset_text_field="text",
max_seq_length=2048,
),
)
trainer.train()
trainer.model.save_pretrained("./fine-tuned-model")
6.5 Merging LoRA Adapters Back into the Base Model
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cpu"
)
# Load and merge LoRA weights
model = PeftModel.from_pretrained(base_model, "./fine-tuned-model")
merged_model = model.merge_and_unload()
# Save merged model (standard HF format, works with vLLM etc.)
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
7. The Hugging Face Ecosystem
7.1 Hub: Discovering Models
The Hugging Face Hub hosts 800,000+ models. Key navigation tips:
from huggingface_hub import list_models, model_info
# Search for models
models = list_models(
filter="text-generation",
sort="downloads",
direction=-1,
limit=10
)
for m in models:
print(m.id, m.downloads)
# Get model info
info = model_info("meta-llama/Meta-Llama-3.1-8B-Instruct")
print(info.tags)
print(info.cardData)
7.2 Model Cards and Licenses
Always read the model card before using a model in production. Check:
- License type (Apache 2.0, Llama Community, MIT, custom)
- Intended use and out-of-scope uses
- Known biases and limitations
- Evaluation results
7.3 GGUF Model Repositories
For llama.cpp / Ollama, look for GGUF quantizations from trusted quantizers:
- bartowski - High-quality GGUF models, multiple quant levels
- TheBloke - Large catalog, though less actively maintained now
- lmstudio-community - Curated for LM Studio
GGUF naming convention:
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
│ │
└─ Model name └─ Quantization method
Common quantization suffixes:
Q4_K_M - 4-bit, medium quality/speed tradeoff (recommended for most use cases)
Q5_K_M - 5-bit, better quality, more memory
Q6_K - 6-bit, near-lossless, high memory
Q8_0 - 8-bit, virtually lossless
Q2_K - 2-bit, very low quality, tiny memory footprint
IQ4_XS - 4-bit with "importance quantization", better quality than Q4_K_M
7.4 Spaces: Running Models in the Browser
Hugging Face Spaces let you try models before downloading:
# Call a Space via API
from gradio_client import Client
client = Client("meta-llama/Llama-3.1-8B-Instruct")
result = client.predict(
message="Explain the transformer attention mechanism",
api_name="/chat"
)
print(result)
8. Quantization Techniques
8.1 Why Quantize?
Llama 3.1 70B in bfloat16 requires ~140 GB of GPU memory — impossible on a single consumer GPU. Quantization reduces memory and speeds up inference at the cost of minor quality degradation.
Memory comparison for Llama 3.1 70B:
| Precision | Memory | Quality | Use Case |
|---|---|---|---|
| bfloat16 | ~140 GB | Baseline | Multi-GPU A100 |
| int8 | ~70 GB | -0.1% | 1× A100 80GB |
| Q4_K_M (GGUF) | ~43 GB | -0.5% | 2× 24GB consumer GPUs |
| int4 (AWQ/GPTQ) | ~35 GB | -1.0% | 1× A100 40GB |
| Q2_K | ~24 GB | -5%+ | Not recommended for production |
8.2 GPTQ: Post-Training Quantization
from transformers import AutoModelForCausalLM, GPTQConfig
# One-time quantization (requires calibration data)
quantization_config = GPTQConfig(
bits=4,
dataset="c4",
block_size=128
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization_config=quantization_config,
device_map="auto"
)
model.save_pretrained("./llama-3.1-8b-gptq")
8.3 AWQ: Activation-aware Weight Quantization
AWQ is generally preferred over GPTQ for accuracy:
# Install
pip install autoawq
# Quantize
python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
quant_path = './llama-3.1-8b-awq'
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print('Quantization complete')
"
8.4 bitsandbytes (BnB) for Training
For fine-tuning (especially QLoRA), bitsandbytes provides runtime quantization without a separate quantization step:
# Already shown in the QLoRA section above
# BitsAndBytesConfig with load_in_4bit=True
# NF4 quantization is best for QLoRA fine-tuning
9. Choosing the Right Model
9.1 Decision Framework
┌──────────────────────┐
│ What is your task? │
└──────────┬───────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
Code generation Multilingual General chat
│ │ │
Qwen2.5-Coder Qwen2.5 / Llama Llama 3.1 8B
DeepSeek-Coder / Mistral NeMo Mistral 7B
CodeGemma Gemma 2 9B
│
▼
Do you need to
run locally?
│
┌─────┴─────┐
│Yes │No
▼ ▼
Fit in RAM? Use vLLM
│ + Llama 70B
┌─┴──┐ or 405B
│ │
<=8GB >8GB
│ │
8B 70B
Q4 Q4_K_M
9.2 Hardware Requirements
| Model | Min GPU Memory | Recommended | Quantization |
|---|---|---|---|
| Gemma 2 2B | 3 GB | RTX 3060 | fp16 |
| Llama 3.1 8B | 5 GB | RTX 3060 12GB | Q4_K_M |
| Mistral 7B | 5 GB | RTX 3060 12GB | Q4_K_M |
| Gemma 2 27B | 16 GB | RTX 3090 | Q4_K_M |
| Llama 3.1 70B | 40 GB | 2× A6000 | Q4_K_M |
| Mixtral 8x7B | 26 GB | A100 40GB | Q4_K_M |
| Llama 3.1 405B | 200 GB | 4× A100 80GB | Q4_K_M |
9.3 Benchmark-Based Selection (March 2026)
For production model selection, use LMSYS Chatbot Arena and the Open LLM Leaderboard as starting points, but always run your own domain-specific evals. Benchmark rankings change monthly as new models are released.
General guidance:
- Best small model (≤8B): Llama 3.1 8B or Gemma 2 9B
- Best mid-size (7B–30B): Llama 3.3 70B or Mistral Small 3
- Best open-weight overall: DeepSeek-V3 or Llama 3.1 405B
- Best for code: Qwen2.5-Coder 32B or DeepSeek-Coder-V2
- Best for reasoning: QwQ 32B or DeepSeek-R1
10. Open-Source vs Proprietary: When to Use What
10.1 Head-to-Head Comparison
| Dimension | Open-Source | Proprietary |
|---|---|---|
| Quality ceiling | Slightly lower (in 2026, gap is small) | Higher for cutting-edge tasks |
| Cost at scale | Low (hardware cost only) | High per-token |
| Data privacy | Full control | Data leaves your infra |
| Setup complexity | High | Low (API key only) |
| Customization | Full (fine-tuning, prompts) | Limited (prompts, some fine-tuning) |
| Reliability | Your responsibility | SLA from provider |
| Latency | Your infrastructure | Variable (shared) |
| Context window | Up to 128K+ | Up to 200K+ |
| Multimodal | Limited (best models are text-only) | Strong (GPT-4o, Claude 3.5) |
10.2 When to Choose Open-Source
Strongly prefer open-source when:
- Processing sensitive data (medical, legal, financial, PII)
- High-volume, cost-sensitive applications (>1M tokens/day)
- Compliance requires data residency
- You have a domain-specific fine-tuning need
- You want to avoid vendor lock-in
Strongly prefer proprietary when:
- Maximum out-of-the-box quality is critical
- You need the latest multimodal capabilities
- Team lacks MLOps expertise
- Building a prototype quickly (API is faster to start)
- Task requires very long context (>128K tokens)
10.3 Hybrid Strategy
Many production systems combine both:
def route_request(request: dict) -> str:
"""Route to open-source or proprietary based on requirements."""
# Always use open-source for sensitive data
if request.get("contains_pii") or request.get("confidential"):
return "local_llama_70b"
# Use open-source for high-volume simple tasks
if request.get("task_type") in ["classification", "extraction", "summarization"]:
if request.get("volume") == "high":
return "local_llama_8b"
# Use proprietary for complex reasoning or multimodal
if request.get("requires_vision") or request.get("complexity") == "high":
return "gpt_4o"
# Default: local model for cost control
return "local_llama_70b"
Summary
The open-source LLM landscape in 2026 offers unprecedented capability:
| Layer | Top Choices |
|---|---|
| Small models (≤8B) | Llama 3.1 8B, Gemma 2 9B, Phi-4 |
| Mid-size (8–30B) | Mistral Small 3, Qwen2.5 32B, Gemma 2 27B |
| Large (70B+) | Llama 3.1 70B, Qwen2.5 72B |
| Frontier | DeepSeek-V3, Llama 3.1 405B |
| Code | Qwen2.5-Coder 32B, DeepSeek-Coder-V2 |
| Reasoning | DeepSeek-R1, QwQ 32B |
| Local inference | Ollama, llama.cpp |
| Production serving | vLLM, TGI |
| Fine-tuning | LoRA + SFTTrainer, QLoRA |
The most important shift from 2024 to 2026 has been the closing of the quality gap with proprietary models. For the vast majority of applications — RAG, chatbots, code generation, extraction — the top open-source models are entirely competitive with GPT-4 or Claude. The primary reasons to choose open-source have never been stronger: privacy, cost, and customization.
Knowledge Check Quiz
Q1. What is the difference between an "open-weight" model and a "truly open-source" model?
An open-weight model releases the model weights publicly but does not release the training data, training code, or full training methodology (examples: Llama 3, Mistral, Gemma). A truly open-source model releases weights, training code, and training data (examples: OLMo, Pythia). The distinction matters for reproducibility, research, and understanding potential data contamination.
Q2. What is Mixture of Experts (MoE) and why does it matter for inference efficiency?
MoE replaces dense feed-forward layers with multiple "expert" subnetworks and a router that selects only a small subset (e.g., 2 of 8 experts) per token. This means the model has many more total parameters than a dense model of the same inference cost. For example, Mixtral 8x7B has 56B total parameters but only ~12.9B active parameters per token, giving it near-70B quality at roughly 13B inference cost.
Q3. What is QLoRA and what hardware constraint does it solve?
QLoRA combines 4-bit quantization (via bitsandbytes NF4) with LoRA adapters. The base model is quantized to 4-bit, reducing its memory footprint by 4×, while trainable LoRA adapters remain in higher precision. This allows fine-tuning a 70B model on a single 80GB A100 GPU, which would be impossible with full-precision fine-tuning (which would require ~280 GB just for the model and optimizer states).
Q4. What is PagedAttention in vLLM and what problem does it solve?
PagedAttention manages the KV cache using a paging mechanism borrowed from operating system virtual memory management. Traditional LLM servers pre-allocate a fixed block of memory per request for the KV cache, wasting memory when sequences are shorter than the maximum. PagedAttention allocates memory in small pages as needed and can share pages between requests, dramatically increasing GPU memory utilization and enabling higher throughput on the same hardware.