- Published on
2025 Open Source AI Models Showdown: DeepSeek R1 vs Llama 4 vs Qwen 3 vs Mistral
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- 1. The 2025 Open Source AI Battlefield
- 2. Model Profiles
- 3. Benchmark Showdown
- 4. License Comparison
- 5. Local Deployment Practical Guide
- 6. Cost Analysis: Cloud API vs Local vs Open Source API
- 7. Best Model by Use Case
- 8. Five 2025 Open Source AI Trends
- Practice Quiz
- 9. Practical Adoption Checklist
- References
1. The 2025 Open Source AI Battlefield
Until 2024, the AI market was dominated by OpenAI's GPT-4. But in 2025, the open-source camp launched a full-scale counteroffensive. The catalyst was DeepSeek from China.
From GPT-4 Dominance to Open Source Resurgence
In January 2025, DeepSeek R1 shattered the AI landscape. A 671B parameter MoE model released under the MIT license matched or exceeded GPT-4 on several benchmarks. The training cost was approximately 8.2 million USD, roughly 1/100th of GPT-4's estimated cost.
This shock triggered a chain reaction:
- Meta released Llama 4 Scout and Maverick, achieving an extraordinary 10M token context window
- Alibaba deployed the Qwen 3 series with a full lineup from 0.6B to 235B
- Mistral proved itself as Europe's champion with the 8x22B MoE model delivering best-in-class cost efficiency
Enterprise Adoption Surging
According to Red Hat's 2025 survey, enterprise adoption of open-source AI models increased by 82% year-over-year. Key drivers include:
- Data Sovereignty: Sensitive data never leaves internal infrastructure
- Cost Reduction: Local execution up to 50x cheaper than API costs
- Customization: Domain-specific models through fine-tuning
- Vendor Independence: Reduced dependency on specific providers
The Big Four
| Organization | Country | Flagship Model | Strategy |
|---|---|---|---|
| DeepSeek | China | R1 (671B MoE) | MIT License + Pure RL Innovation |
| Meta | USA | Llama 4 Scout/Maverick | Ecosystem Dominance + Multimodal |
| Alibaba | China | Qwen 3 (235B MoE) | Full Lineup + Multilingual |
| Mistral | France | 8x22B (176B MoE) | European AI Sovereignty + Value |
2. Model Profiles
DeepSeek R1 (671B / 37B MoE)
DeepSeek R1 was the biggest event in open-source AI in 2025. Published in Nature, the model's core innovation was training reasoning capabilities using pure reinforcement learning (RL) alone.
Architecture:
- Total Parameters: 671B
- Active Parameters: 37B (only ~5.5% active during inference)
- Number of Experts: 256 (8 activated per token)
- Context Length: 128K tokens
- Training Cost: ~8.2M USD (1/100th of GPT-4)
Benchmark Results:
- AIME 2024: 79.8% (math olympiad level)
- MATH-500: 97.3%
- HumanEval: 92.7%
- MMLU: 90.8%
Training Methodology Innovation:
DeepSeek R1's greatest innovation lies in its training approach. Instead of the traditional supervised fine-tuning (SFT) approach, reasoning capabilities were developed through pure reinforcement learning alone. During this process, the model naturally acquired Chain-of-Thought reasoning, self-verification, and error correction abilities.
DeepSeek R1 Training Pipeline:
1. Base model pretraining (large-scale text data)
2. Pure RL training (GRPO algorithm)
- Reward: correctness only (no process rewards)
- Result: self-discovers reasoning strategies
3. Distillation -> transfer to smaller models (1.5B ~ 70B)
License:
MIT License, the most permissive among all four models. Commercial use, modification, and redistribution are all unrestricted.
Llama 4 Scout (109B/17B) and Maverick (400B/17B)
Meta's Llama 4 ships in two variants, each targeting different use cases.
Scout Model (109B total / 17B active):
- Number of Experts: 16
- Context Length: 10M tokens (longest ever)
- Specialty: Optimized for efficient long-document processing
- Runs on a single H100 GPU
The Scout model's 10M token context is a breakthrough over existing models. This means processing thousands of pages of documents in a single pass.
Maverick Model (400B total / 17B active):
- Number of Experts: 128
- Shared Expert architecture for stable training
- Context Length: 1M tokens
- Natively multimodal (text + image)
Multimodal Capabilities:
Llama 4 was designed as multimodal from the ground up. The ability to process text and images simultaneously is built in, requiring no separate adapter.
Llama 4 Variant Comparison:
+--------------+-----------+-----------+
| | Scout | Maverick |
+--------------+-----------+-----------+
| Total Params | 109B | 400B |
| Active Params| 17B | 17B |
| Experts | 16 | 128 |
| Context | 10M | 1M |
| Multimodal | Yes | Yes |
| GPU (FP16) | 1xH100 | 8xH100 |
+--------------+-----------+-----------+
License:
Meta Custom License. Commercial use is permitted, but services exceeding 700 million monthly active users (MAU) require separate authorization from Meta.
Qwen 3 (0.6B ~ 235B)
Alibaba's Qwen 3 offers the broadest model lineup, ranging from 0.6B to 235B.
235B MoE Model (22B active):
- Apache 2.0 License
- 29 language support (strongest in CJK languages)
- 1M token context
- "Thinking Mode" support: switches between thinking and non-thinking modes in a single model
Full Lineup:
Qwen 3 Model Lineup:
+-- Dense Models
| +-- Qwen3-0.6B (mobile/IoT)
| +-- Qwen3-1.7B (edge devices)
| +-- Qwen3-4B (local chatbot)
| +-- Qwen3-8B (general local)
| +-- Qwen3-14B (coding/analysis)
| +-- Qwen3-32B (high-performance local)
| +-- Qwen3-72B (enterprise)
+-- MoE Model
+-- Qwen3-235B (22B active, best performance)
Thinking Mode Innovation:
Qwen 3 supports two modes in a single model:
- Thinking mode: Performs step-by-step reasoning for complex math, coding, and logic problems
- Non-thinking mode: Responds quickly to simple questions
Users can switch modes using /think and /no_think tags, allowing them to balance cost and latency according to the situation.
Multilingual Performance:
Supports 29 languages with dominant performance in CJK (Chinese, Japanese, Korean) languages. This is due to the inclusion of massive CJK corpora in the training data.
Mistral 8x22B (176B / 39B MoE)
Europe's representative Mistral is the value king.
Architecture:
- Total Parameters: 176B
- Active Parameters: 39B (2 of 8 experts active)
- Apache 2.0 License
- 65K token context
Strengths:
- Delivers GPT-4-adjacent performance at 1/10th the cost
- Optimized for European multilingual (English, French, German, Italian, Spanish)
- Strong in function calling and JSON output
- Excellent code generation
Mistral 8x22B Expert Routing:
Input Token -> Gate Network -> Top-2 Expert Selection
|
Expert 1 (active) <-- weighted sum --> Output
Expert 5 (active) <--+
Expert 2 (inactive)
Expert 3 (inactive)
Expert 4 (inactive)
Expert 6 (inactive)
Expert 7 (inactive)
Expert 8 (inactive)
Cornerstone of the European AI Ecosystem:
Mistral plays a pivotal role in European enterprise AI adoption through proactive EU AI Act compliance and data sovereignty guarantees. It also provides its own AI service through the Le Chat platform.
3. Benchmark Showdown
The table below compares key benchmark results for each model. All figures are based on official announcements and may not reflect identical testing conditions.
| Benchmark | DeepSeek R1 | Llama 4 Maverick | Qwen 3 235B | Mistral 8x22B | GPT-4o (ref) |
|---|---|---|---|---|---|
| MMLU | 90.8% | 88.2% | 89.5% | 84.0% | 88.7% |
| MMLU-Pro | 84.0% | 80.5% | 82.3% | 76.8% | 83.5% |
| HumanEval | 92.7% | 89.4% | 90.2% | 85.3% | 90.2% |
| MATH-500 | 97.3% | 85.6% | 90.8% | 78.5% | 86.8% |
| AIME 2024 | 79.8% | 52.3% | 68.5% | 42.1% | 55.6% |
| GSM8K | 97.1% | 95.8% | 96.5% | 93.2% | 96.1% |
| GPQA Diamond | 71.5% | 62.1% | 66.8% | 55.3% | 63.7% |
| Arena ELO | 1358 | 1340 | 1345 | 1280 | 1350 |
| MT-Bench | 9.3 | 9.1 | 9.2 | 8.7 | 9.2 |
Key Analysis:
- Math/Reasoning: Overwhelming dominance by DeepSeek R1. Leads competitors significantly on AIME and MATH-500
- Coding: DeepSeek R1 takes first place, Qwen 3 follows closely in second
- General Purpose: Llama 4 Maverick records high Arena ELO with well-balanced performance
- Cost Efficiency: Mistral 8x22B is the most efficient (best performance/cost ratio)
4. License Comparison
Licensing is one of the most critical factors when deploying open-source AI models in production.
| Attribute | DeepSeek R1 | Llama 4 | Qwen 3 | Mistral 8x22B |
|---|---|---|---|---|
| License | MIT | Meta Custom | Apache 2.0 | Apache 2.0 |
| Commercial Use | Unrestricted | Under 700M MAU | Unrestricted | Unrestricted |
| Fine-tuning | Free | Free | Free | Free |
| Redistribution | Free | Conditional | Free | Free |
| Distillation | Explicitly Allowed | Restricted | Allowed | Allowed |
| Output Ownership | User | User | User | User |
| Patent Protection | None | Yes | Yes (Apache) | Yes (Apache) |
| Restrictions | None | MAU limit, multimodal limits | None | None |
License Selection Guide:
- Maximum freedom: DeepSeek R1 (MIT) - absolutely no restrictions
- Patent protection needed: Qwen 3 or Mistral (Apache 2.0) - includes patent retaliation clause
- Large-scale services: Avoid Llama 4 (beware of 700M MAU limit)
- Distillation purposes: DeepSeek R1 most clearly permits this
5. Local Deployment Practical Guide
5.1 Getting Started with Ollama (Easiest)
Ollama is the simplest way to run LLMs locally.
Installation:
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download installer from the official website
Running Models:
# DeepSeek R1 (various sizes)
ollama run deepseek-r1:1.5b # minimum specs, 2GB RAM
ollama run deepseek-r1:7b # general use, 8GB RAM
ollama run deepseek-r1:14b # recommended, 16GB RAM
ollama run deepseek-r1:32b # high performance, 32GB RAM
ollama run deepseek-r1:70b # maximum performance, 64GB RAM
# Llama 4 Scout
ollama run llama4-scout:17b
# Qwen 3
ollama run qwen3:8b
ollama run qwen3:14b
ollama run qwen3:32b
ollama run qwen3:72b
# Mistral
ollama run mistral:8x22b
API Server Mode:
# Start default server (port 11434)
ollama serve
# Call API from another process
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-r1:14b",
"prompt": "Implement quicksort in Python"
}'
5.2 llama.cpp + GGUF Quantized Deployment
When you need finer control, use llama.cpp directly.
Build:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # For NVIDIA GPU
cmake --build build --config Release
Quantization Options:
| Quantization | Bits | Model Size (7B) | Quality Loss | Speed | Recommended For |
|---|---|---|---|---|---|
| FP16 | 16-bit | 14GB | None | Baseline | Plenty of VRAM |
| Q8_0 | 8-bit | 7GB | Minimal | Fast | Performance first |
| Q5_K_M | 5-bit | 5GB | Negligible | Fast | Balanced choice |
| Q4_K_M | 4-bit | 4GB | Small | Very fast | Limited VRAM |
| Q3_K_M | 3-bit | 3.5GB | Noticeable | Very fast | Extreme savings |
| Q2_K | 2-bit | 2.8GB | Significant | Fastest | Experimental only |
Running Example:
# Download GGUF model (Hugging Face)
# e.g., DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
# Run
./build/bin/llama-cli \
-m DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf \
-c 4096 \
-ngl 99 \
--temp 0.6 \
-p "Explain how to set up a Redis cluster with Docker Compose"
5.3 Production Serving with vLLM
For production environments, vLLM is optimal.
# Install vLLM
pip install vllm
# Start server
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--port 8000
vLLM advantages:
- PagedAttention for maximum memory efficiency
- Continuous Batching for throughput optimization
- OpenAI-compatible API
- Automatic tensor parallel processing support
# Call via OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B",
"messages": [
{"role": "user", "content": "Write a Kubernetes CronJob manifest"}
],
"temperature": 0.7,
"max_tokens": 2048
}'
5.4 Hardware Requirements
| Model | VRAM (FP16) | VRAM (Q4_K_M) | Recommended GPU | Approx. Cost |
|---|---|---|---|---|
| DeepSeek R1 7B | 14GB | 4GB | RTX 4060 Ti | ~400 USD |
| DeepSeek R1 14B | 28GB | 8GB | RTX 4070 Ti | ~800 USD |
| DeepSeek R1 32B | 64GB | 18GB | RTX 4090 | ~1,600 USD |
| DeepSeek R1 70B | 140GB | 40GB | 2x RTX 4090 | ~3,200 USD |
| Llama 4 Scout | 218GB | 62GB | 3x RTX 4090 | ~4,800 USD |
| Qwen 3 72B | 144GB | 42GB | 2x RTX 4090 | ~3,200 USD |
| Qwen 3 235B | 470GB | 135GB | 8x H100 | ~250,000 USD |
| Mistral 8x22B | 352GB | 100GB | 4x H100 | ~125,000 USD |
For individual users, 7B-14B quantized models are the practical choice. An RTX 4060 Ti 16GB or M-series Mac is sufficient.
6. Cost Analysis: Cloud API vs Local vs Open Source API
6.1 Cloud API Cost Comparison (Per 1M Tokens)
| Provider | Model | Input Price | Output Price | Notes |
|---|---|---|---|---|
| OpenAI | GPT-4o | 2.50 USD | 10.00 USD | Top performance, high cost |
| OpenAI | GPT-4o-mini | 0.15 USD | 0.60 USD | Value option |
| Anthropic | Claude 3.5 Sonnet | 3.00 USD | 15.00 USD | Best at coding |
| Gemini 1.5 Pro | 1.25 USD | 5.00 USD | Long context | |
| DeepSeek | DeepSeek R1 | 0.14 USD | 0.28 USD | Price disruptor |
| Alibaba | Qwen 3 235B | 0.24 USD | 0.48 USD | CJK optimized |
| Mistral | 8x22B | 0.20 USD | 0.60 USD | European servers |
DeepSeek's API is approximately 18x cheaper on input and 36x cheaper on output compared to GPT-4o.
6.2 Local Execution Cost Analysis
Initial Investment:
| Tier | Hardware | Price | Runnable Models |
|---|---|---|---|
| Entry | RTX 4060 Ti 16GB | ~400 USD | 7B-14B (Q4) |
| Mid | RTX 4090 24GB | ~1,600 USD | 14B-32B (Q4) |
| High | 2x RTX 4090 | ~3,200 USD | 70B (Q4) |
| Expert | NVIDIA DGX Spark | ~3,999 USD | 70B+ (FP16) |
| Production | 8x H100 | ~250,000 USD | 235B+ (FP16) |
Break-Even Calculation:
Scenario: 1M tokens per day usage:
GPT-4o API monthly cost: (2.50 + 10.00) x 30 = 375 USD/month
DeepSeek API monthly cost: (0.14 + 0.28) x 30 = 12.6 USD/month
Local RTX 4090 (electricity only): ~15 USD/month
RTX 4090 break-even vs GPT-4o: ~6 months
RTX 4090 break-even vs DeepSeek API: ~74 months (not recommended)
Conclusion: DeepSeek's API is already so affordable that individual users choosing local deployment do so primarily for privacy and offline access, not cost savings.
6.3 Cost Optimization Strategies
- Hybrid Approach: Local for sensitive data, API for general tasks
- Model Size Optimization: Not every task needs the largest model
- Quantization: Q4_K_M delivers sufficient performance for most tasks
- Caching: Cache results for frequently used prompts
- Batch Processing: Non-real-time tasks can be batched for cost reduction
7. Best Model by Use Case
7.1 Comprehensive Recommendation Table
| Use Case | Top Pick | Runner-Up | Reason |
|---|---|---|---|
| Coding | DeepSeek R1 | Qwen 3 72B | HumanEval 92.7%, strongest code reasoning |
| Multilingual (CJK) | Qwen 3 235B | DeepSeek R1 | 29 languages, best CJK performance |
| General Chat | Llama 4 Maverick | Qwen 3 235B | Meta ecosystem, high Arena ELO |
| Value | Mistral 8x22B | DeepSeek R1 | Best performance-to-cost ratio |
| Math/Reasoning | DeepSeek R1 | Qwen 3 (Thinking) | AIME 79.8%, published in Nature |
| Long Document | Llama 4 Scout | Qwen 3 235B | 10M context, efficient processing |
| Mobile/Edge | Qwen 3 0.6B-4B | DeepSeek R1 1.5B | Ultra-light, on-device execution |
| EU Compliance | Mistral 8x22B | Qwen 3 | EU AI Act ready, European data centers |
| Multimodal | Llama 4 Maverick | Qwen 3 VL | Native multimodal support |
| RAG Pipeline | Qwen 3 14B | DeepSeek R1 14B | Balanced performance/cost |
7.2 Scenario-Based Detailed Guide
Startup (Budget Constrained):
Recommended Stack:
- Development: DeepSeek R1 API (under 50 USD/month)
- Production: Qwen 3 14B on RTX 4090 (local)
- Rationale: Maximum performance at minimum cost
Enterprise (Regulatory Compliance Required):
Recommended Stack:
- Internal Documents: Qwen 3 72B on private cloud
- Customer Service: Llama 4 Maverick via API
- Analytics: DeepSeek R1 (MIT license = minimal legal risk)
Individual Developer:
Recommended Stack:
- Coding Assistant: DeepSeek R1 14B (Ollama, local)
- General Questions: DeepSeek API (cheapest)
- Learning: Qwen 3 8B (free, local, multilingual)
8. Five 2025 Open Source AI Trends
Trend 1: MoE Becomes the Default Architecture
Among major models released in 2025, 3 out of 4 adopted MoE architecture. This is no coincidence.
MoE advantages:
- Efficiency: Only 5-20% of total parameters are active, reducing inference cost
- Scalability: Performance improves by adding more experts
- Specialization: Each expert specializes in specific domains
Dense models (all parameters always active) are increasingly reserved for smaller models only.
Trend 2: The License Wars -- MIT vs Apache vs Meta Custom
| License | Champions | Philosophy |
|---|---|---|
| MIT | DeepSeek | Total freedom, no restrictions |
| Apache 2.0 | Alibaba, Mistral | Freedom + patent protection |
| Meta Custom | Meta | Freedom, but large-service limits |
DeepSeek's adoption of the MIT license sent shockwaves through the industry. It reignited debates about the true definition of "open source" and raised questions about whether Meta's license can legitimately be called "open source."
Trend 3: The Small Model Rebellion
One of 2025's most surprising findings was that a well-trained 8B model surpasses 2023's GPT-4V on some benchmarks.
This is thanks to:
- Data Quality Improvements: Quality over quantity in training data
- Distillation Technology: Efficient knowledge transfer from large models
- Architecture Improvements: Efficient techniques like GQA and SWA
- Shared Training Recipes: Community-driven optimization know-how
Trend 4: Distillation Technology Matures
DeepSeek R1's distillation model series (1.5B-70B) demonstrates the maturity of distillation technology.
Distillation Pipeline Example:
DeepSeek R1 671B (Teacher Model)
| distill
DeepSeek R1 Distill 70B (retains 85% performance)
| distill
DeepSeek R1 Distill 14B (retains 75% performance)
| distill
DeepSeek R1 Distill 1.5B (retains 60% performance)
The key to distillation is transferring the teacher model's "thought process" to the student model. In DeepSeek R1's case, reasoning abilities acquired through pure RL are passed to smaller models via distillation.
Trend 5: The Rise of Chinese Models
Among the 2025 open-source AI Big Four, 2 are Chinese models (DeepSeek, Qwen). This carries several important implications:
- Technological Self-Reliance: Competitive models developed despite US chip export restrictions
- Cost Innovation: DeepSeek's 8.2M USD training cost shocked the industry
- Open Source Strategy: MIT/Apache licensing to capture the global developer ecosystem
- Geopolitical Implications: New discussions about AI technology polarization and cooperation
Practice Quiz
Test your knowledge with these questions.
Question 1: What are DeepSeek R1's total and active parameter counts?
Answer: Total 671B, Active 37B
DeepSeek R1 activates 8 out of 256 experts per token, using approximately 37B parameters. This represents about 5.5% of the total.
Question 2: What is Llama 4 Scout's maximum context length?
Answer: 10M (10 million) tokens
This is the longest context among open-source models as of 2025. It can process thousands of pages of documents in a single pass.
Question 3: Which of the four models uses the most permissive license?
Answer: DeepSeek R1 (MIT License)
The MIT License places no restrictions on commercial use, modification, or redistribution. Apache 2.0 includes a patent retaliation clause, and Meta Custom has a 700M MAU limit.
Question 4: What is the difference between Qwen 3's Thinking Mode and Non-thinking Mode?
Answer: Thinking Mode performs step-by-step Chain-of-Thought reasoning for complex problems, yielding higher accuracy but slower responses. Non-thinking Mode responds quickly to simple questions. Users can switch modes using tags.
Supporting two modes in a single model is Qwen 3's core innovation.
Question 5: What is the easiest tool for running LLMs locally, and how much RAM is needed for DeepSeek R1 14B at minimum?
Answer: Ollama, approximately 16GB RAM
Ollama lets you run LLMs with a single command. Running DeepSeek R1 14B with Q4_K_M quantization requires about 8GB VRAM, but at least 16GB system RAM is recommended.
9. Practical Adoption Checklist
Here is the essential checklist for deploying open-source AI models in production environments.
Pre-Adoption Evaluation
Technical Requirements:
- Does the model's VRAM requirement match your available hardware?
- Does it support the context length you need?
- Can it meet your latency requirements?
- Does it adequately support the languages you need?
Business Requirements:
- Is the license compatible with your commercial use case?
- Does it meet your data privacy requirements?
- Can you guarantee the necessary SLA (Service Level Agreement)?
- Do you have a long-term maintenance plan?
Operational Requirements:
- Is a monitoring system in place?
- Do you have a fallback strategy for outages?
- Is a model update pipeline designed?
- Is there a security audit process?
Phased Adoption Roadmap
Phase 1: PoC (2-4 weeks)
+-- Define use cases
+-- Select candidate models (2-3)
+-- Run benchmark tests
+-- Cost analysis
Phase 2: Pilot (4-8 weeks)
+-- Deploy to small team
+-- Monitor performance
+-- Collect feedback
+-- Evaluate fine-tuning needs
Phase 3: Production (8-12 weeks)
+-- Build infrastructure
+-- CI/CD pipeline
+-- Monitoring dashboard
+-- Documentation
Phase 4: Optimization (Ongoing)
+-- Cost optimization
+-- Performance tuning
+-- Model upgrades
+-- Team capability building
Common Mistakes and Solutions
Mistake 1: Choosing the largest model from the start
Solution: Start small and scale up incrementally. In many cases, a 14B model is sufficient.
Mistake 2: Underestimating quantization quality
Solution: Q4_K_M delivers nearly identical results to FP16 for most use cases. Always validate with benchmarks.
Mistake 3: Insisting on either API or local deployment exclusively
Solution: Adopt a hybrid approach. Local for sensitive data, API for bulk processing is optimal.
Mistake 4: Insufficient license review
Solution: Always review the license with your legal team before adoption. Llama 4's MAU limit in particular can become a constraint for growing services.
Mistake 5: Production deployment without monitoring
Solution: Build a system to monitor response quality, latency, and error rates in real time.
References
- DeepSeek R1 Technical Report - "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025)
- Meta AI - "Llama 4: Open Foundation Models" Official Blog (2025)
- Alibaba Cloud - "Qwen3 Technical Report" (2025)
- Mistral AI - "Mixtral 8x22B: A Sparse Mixture of Experts" (2024)
- Red Hat - "The State of Enterprise Open Source AI 2025" Report
- Nature - "Reinforcement Learning for Language Model Reasoning" (2025)
- Ollama Official Documentation - ollama.com/docs
- llama.cpp GitHub Repository - github.com/ggml-org/llama.cpp
- vLLM Official Documentation - docs.vllm.ai
- Hugging Face Open LLM Leaderboard (2025)
- LMSYS Chatbot Arena Leaderboard (2025)
- "The Economics of Open Source AI" - a16z Research (2025)
- EU AI Act Official Document - Commission Regulation (EU) 2024/1689
- "Scaling Laws for Mixture of Experts" - arXiv (2025)
- NVIDIA DGX Spark Specifications - nvidia.com/dgx-spark
- "Distillation of Reasoning: From Large to Small Language Models" (2025)
- Alibaba DAMO Academy - "Multilingual LLM Benchmark Suite" (2025)