- Authors
- Name
- 1. Introduction to DeepSeek
- 2. DeepSeek-V1 / DeepSeek LLM (67B)
- 3. DeepSeek-V2 (236B): Birth of MLA and DeepSeekMoE
- 4. DeepSeek-V3 (685B, 37B Active): The Limits of Efficiency
- 5. DeepSeek-R1: Awakening Reasoning Through Reinforcement Learning
- 6. DeepSeek-Coder: Coding-Specialized Models
- 7. DeepSeek-VL / Janus: Vision-Language Models
- 8. MLA vs MHA vs GQA vs MQA Comparison
- 9. Industrial Impact
- 10. Limitations and Future Outlook
- References
1. Introduction to DeepSeek
1.1 Company Background: From Hedge Fund to AI Research Lab
DeepSeek was founded in July 2023 in Hangzhou, China. Its founder, Liang Wenfeng, is also the CEO of High-Flyer, a quant hedge fund established in 2016. High-Flyer grew into one of China's largest hedge funds through AI-based algorithmic trading.
High-Flyer possessed large-scale GPU clusters for trading, and after declaring the establishment of an AGI research lab in April 2023, spun off DeepSeek as an independent entity in July of the same year. DeepSeek is thus a unique AI research organization born from a hedge fund's financial power and GPU infrastructure.
1.2 Open-Source Philosophy
DeepSeek's most distinctive feature is its fully open-source strategy. DeepSeek releases all model weights under MIT License or commercially permissive licenses, and publishes detailed technical reports on arXiv for every model. This stands in stark contrast to the closed approaches of OpenAI and Anthropic, forming one of the two pillars of the open-source LLM ecosystem alongside Meta's Llama series.
1.3 Position in the Chinese AI Ecosystem
Among Chinese AI players (Baidu ERNIE, Alibaba Qwen, ByteDance Doubao, Zhipu AI GLM, Moonshot AI Kimi), DeepSeek distinguishes itself through:
- Focus on fundamental research over product launches
- Fully open-source: Most aggressive open-source strategy among Chinese AI companies
- Cost efficiency: Proven ability to train world-class models at a fraction of US costs
- Independent funding: Operating on High-Flyer's own capital without VC dependency
2. DeepSeek-V1 / DeepSeek LLM (67B)
2.1 Model Overview
DeepSeek's first foundation model, DeepSeek LLM, was released in January 2024 in 7B and 67B sizes.
| Item | DeepSeek LLM 7B | DeepSeek LLM 67B |
|---|---|---|
| Parameters | 7B | 67B |
| Training Data | 2T tokens | 2T tokens |
| Context Length | 4K | 4K |
| Architecture | Dense Transformer | Dense Transformer |
2.2 Scaling Law Research
The paper's most important contribution was independent scaling law research, going beyond the Chinchilla Scaling Law to present new findings on batch size scaling, learning rate scaling, and data-model allocation strategies.
3. DeepSeek-V2 (236B): Birth of MLA and DeepSeekMoE
DeepSeek-V2, released in May 2024, introduced two of DeepSeek's most critical architectural innovations: Multi-head Latent Attention (MLA) and DeepSeekMoE.
3.1 Model Specifications
| Item | Value |
|---|---|
| Total Parameters | 236B |
| Active Parameters (per token) | 21B |
| Context Length | 128K |
| Training Data | 8.1T tokens |
| MoE Structure | 2 Shared Experts + 160 Routed Experts (6 active) |
| Attention | Multi-head Latent Attention (MLA) |
3.2 Multi-head Latent Attention (MLA)
MLA compresses KV Cache by 93.3% while achieving performance equal to or better than standard Multi-head Attention (MHA). This fundamentally addresses the KV Cache memory bottleneck in LLM inference.
Core Idea: Low-Rank KV Joint Compression
Instead of reducing KV Head count, MLA jointly compresses KV into a low-dimensional latent vector. During inference, only this small latent vector is cached, and it is decompressed to the original dimension when needed.
Step 1: Down-Projection (Compression)
Step 2: Up-Projection (Decompression)
The key point: what is stored in the cache is the compressed latent vector , not the original K, V vectors.
Decoupled Rotary Position Embedding solves the challenge of integrating RoPE with the compressed latent vectors by using a separate small RoPE key vector .
3.3 DeepSeekMoE: Fine-Grained Expert Segmentation and Shared Experts
Fine-Grained Expert Segmentation: Instead of N large experts with K active, DeepSeekMoE uses mN small experts with mK active. The total parameters and computation remain the same, but the combinations of active experts become vastly more diverse.
Traditional MoE: N=16 experts, K=2 active → C(16,2) = 120 combinations
DeepSeekMoE: mN=64 experts, mK=8 active → C(64,8) ≈ 4.4B combinations
Shared Expert Isolation: Some experts are designated as Shared Experts that are always active for all tokens, handling common knowledge like grammar and common sense. This frees Routed Experts to focus on their specialized domains.
4. DeepSeek-V3 (685B, 37B Active): The Limits of Efficiency
4.1 Model Specifications
| Item | Value |
|---|---|
| Total Parameters | 671B |
| Active Parameters (per token) | 37B |
| Context Length | 128K |
| Training Data | 14.8T tokens |
| MoE Structure | 1 Shared Expert + 256 Routed Experts (8 active) |
| Training GPU Hours | 2.788M H800 GPU hours |
| Training Cost (est.) | $5.576M |
4.2 FP8 Mixed Precision Training
DeepSeek-V3 is the first publicly known large-scale model trained with FP8. Key innovations include unified E4M3 format, Fine-Grained Quantization (tile-wise and block-wise scaling), and High-Precision FP32 Accumulation.
4.3 Auxiliary-Loss-Free Load Balancing
Instead of using auxiliary loss functions (which face the dilemma of trading off model quality for balance), DeepSeek-V3 introduces Bias-Based Dynamic Balancing where each expert gets a dynamically adjusted bias term that affects only routing decisions, not the actual expert outputs.
4.4 Multi-Token Prediction (MTP)
DeepSeek-V3 adopts MTP as a training objective, predicting the next n tokens at each position. The implementation maintains causal chain ordering and can be used for Speculative Decoding during inference.
4.5 Training Cost: The $5.6M Controversy
The total training cost was approximately **100M+. However, this figure only covers the final training run and excludes architecture exploration, hardware purchase costs, and personnel costs.
4.6 Benchmark Results
| Benchmark | DeepSeek-V3 | GPT-4o | Claude 3.5 Sonnet | Llama 3.1 405B |
|---|---|---|---|---|
| MMLU | 88.5 | 87.2 | 88.7 | 88.6 |
| HumanEval | 82.6 | 80.5 | 81.1 | 72.0 |
| MATH-500 | 90.2 | 76.6 | 78.3 | 73.8 |
| AIME 2024 | 39.2 | 15.7 | - | 23.3 |
| Codeforces | 51.6 | 23.0 | 17.5 | 21.0 |
4.7 API Cost Comparison
DeepSeek-V3 is approximately 9x cheaper than GPT-4o for both input and output, with comparable or better performance.
5. DeepSeek-R1: Awakening Reasoning Through Reinforcement Learning
5.1 R1-Zero: Remarkable Results of Pure RL
R1-Zero applied RL directly to the DeepSeek-V3 Base model without any Supervised Fine-Tuning, using only accuracy rewards and format rewards.
"Aha Moment": During RL training, the model spontaneously developed advanced reasoning patterns including self-verification, self-reflection, dynamic strategy adaptation, and longer reasoning chains for difficult problems. These behaviors emerged without being explicitly taught.
5.2 Group Relative Policy Optimization (GRPO)
GRPO eliminates the Critic Model requirement of PPO by using group-relative rewards, reducing memory usage by nearly half while maintaining equal or better training stability.
5.3 DeepSeek-R1 Training Pipeline
The four-stage pipeline (Cold Start SFT -> Reasoning-focused RL -> Rejection Sampling + SFT -> Full-domain RL) preserves R1-Zero's reasoning ability while adding readability and general task capability.
5.4 Benchmark Results
| Benchmark | DeepSeek-R1 | OpenAI o1-1217 |
|---|---|---|
| AIME 2024 (Pass@1) | 79.8% | 79.2% |
| MATH-500 (Pass@1) | 97.3% | 96.4% |
| SWE-Bench Verified | 49.2% | 48.9% |
5.5 Distillation: Transferring Reasoning to Smaller Models
R1-Distill-Qwen-32B achieved 72.6% on AIME 2024, significantly outperforming OpenAI o1-mini (63.6%). Even the 1.5B model demonstrated math reasoning capabilities that sometimes surpassed GPT-4o.
6. DeepSeek-Coder: Coding-Specialized Models
DeepSeek-Coder V2 achieved 90.2% on HumanEval, surpassing GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro, demonstrating that open-source models can break through the closed-model barrier in code intelligence.
7. DeepSeek-VL / Janus: Vision-Language Models
Janus introduces an innovative approach of unifying multimodal understanding and visual generation in a single model through decoupled vision encoding paths processed by a single Unified Transformer.
8. MLA vs MHA vs GQA vs MQA Comparison
| Method | KV Cache (Relative Size) | Performance Impact |
|---|---|---|
| MHA | 100% (baseline) | Maximum (baseline) |
| MQA | ~1.6% | Performance degradation |
| GQA (8 groups) | ~12.5% | Slight degradation |
| MLA | ~6.7% | No degradation / slight improvement |
MLA uses a smaller KV Cache than GQA while maintaining performance equal to or better than MHA. This is why MLA is considered revolutionary.
9. Industrial Impact
9.1 NVIDIA Stock Shock
On January 27, 2025, NVIDIA's stock dropped 17%, evaporating approximately $589B in market cap in a single day -- the largest single-day market cap loss for any stock in US stock market history.
9.2 Impact on US AI Policy
DeepSeek's success questioned the US technology sanctions strategy against China, demonstrating that H800 GPUs (with restricted performance compared to A100/H100) were sufficient to build world-class models.
10. Limitations and Future Outlook
Current Limitations: Safety/alignment concerns due to Chinese AI regulations, multilingual performance variance, inefficient lengthy reasoning chains, lack of real-time knowledge, and hallucination.
Future Outlook: MLA architecture adoption spreading, GRPO becoming the de facto standard for LLM RL, cost-efficient training becoming a new competitive axis, and multimodal integration through Janus/Janus-Pro evolving toward universal models.
References
- DeepSeek-AI. "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism." arXiv:2401.02954, 2024.
- DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024.
- DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024.
- DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, 2025.
- DeepSeek-AI. "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence." arXiv:2406.11931, 2024.