DeepSeek Model Complete Analysis: From MLA and MoE to RL-Based Reasoning — Everything About the Chinese Open-Source LLM Innovation

1. Introduction to DeepSeek
2. DeepSeek-V1 / DeepSeek LLM (67B)
- 2.1 Model Overview
- 2.2 Scaling Law Research
3. DeepSeek-V2 (236B): Birth of MLA and DeepSeekMoE
4. DeepSeek-V3 (685B, 37B Active): The Limits of Efficiency
5. DeepSeek-R1: Awakening Reasoning Through Reinforcement Learning
6. DeepSeek-Coder: Coding-Specialized Models
7. DeepSeek-VL / Janus: Vision-Language Models
8. MLA vs MHA vs GQA vs MQA Comparison
9. Industrial Impact
- 9.1 NVIDIA Stock Shock
- 9.2 Impact on US AI Policy
10. Limitations and Future Outlook
References

1. Introduction to DeepSeek

1.1 Company Background: From Hedge Fund to AI Research Lab

DeepSeek was founded in July 2023 in Hangzhou, China. Its founder, Liang Wenfeng, is also the CEO of High-Flyer, a quant hedge fund established in 2016. High-Flyer grew into one of China's largest hedge funds through AI-based algorithmic trading.

High-Flyer possessed large-scale GPU clusters for trading, and after declaring the establishment of an AGI research lab in April 2023, spun off DeepSeek as an independent entity in July of the same year. DeepSeek is thus a unique AI research organization born from a hedge fund's financial power and GPU infrastructure.

1.2 Open-Source Philosophy

DeepSeek's most distinctive feature is its fully open-source strategy. DeepSeek releases all model weights under MIT License or commercially permissive licenses, and publishes detailed technical reports on arXiv for every model. This stands in stark contrast to the closed approaches of OpenAI and Anthropic, forming one of the two pillars of the open-source LLM ecosystem alongside Meta's Llama series.

1.3 Position in the Chinese AI Ecosystem

Among Chinese AI players (Baidu ERNIE, Alibaba Qwen, ByteDance Doubao, Zhipu AI GLM, Moonshot AI Kimi), DeepSeek distinguishes itself through:

Focus on fundamental research over product launches
Fully open-source: Most aggressive open-source strategy among Chinese AI companies
Cost efficiency: Proven ability to train world-class models at a fraction of US costs
Independent funding: Operating on High-Flyer's own capital without VC dependency

2. DeepSeek-V1 / DeepSeek LLM (67B)

2.1 Model Overview

DeepSeek's first foundation model, DeepSeek LLM, was released in January 2024 in 7B and 67B sizes.

Item	DeepSeek LLM 7B	DeepSeek LLM 67B
Parameters	7B	67B
Training Data	2T tokens	2T tokens
Context Length	4K	4K
Architecture	Dense Transformer	Dense Transformer

2.2 Scaling Law Research

The paper's most important contribution was independent scaling law research, going beyond the Chinchilla Scaling Law to present new findings on batch size scaling, learning rate scaling, and data-model allocation strategies.

3. DeepSeek-V2 (236B): Birth of MLA and DeepSeekMoE

DeepSeek-V2, released in May 2024, introduced two of DeepSeek's most critical architectural innovations: Multi-head Latent Attention (MLA) and DeepSeekMoE.

3.1 Model Specifications

Item	Value
Total Parameters	236B
Active Parameters (per token)	21B
Context Length	128K
Training Data	8.1T tokens
MoE Structure	2 Shared Experts + 160 Routed Experts (6 active)
Attention	Multi-head Latent Attention (MLA)

3.2 Multi-head Latent Attention (MLA)

MLA compresses KV Cache by 93.3% while achieving performance equal to or better than standard Multi-head Attention (MHA). This fundamentally addresses the KV Cache memory bottleneck in LLM inference.

Core Idea: Low-Rank KV Joint Compression

Instead of reducing KV Head count, MLA jointly compresses KV into a low-dimensional latent vector. During inference, only this small latent vector is cached, and it is decompressed to the original dimension when needed.

Step 1: Down-Projection (Compression)

$c_t^{KV} = W^{DKV} h_t$

Step 2: Up-Projection (Decompression)

$k_t^C = W^{UK} c_t^{KV}, \quad v_t^C = W^{UV} c_t^{KV}$

The key point: what is stored in the cache is the compressed latent vector $c_t^{KV}$ , not the original K, V vectors.

Decoupled Rotary Position Embedding solves the challenge of integrating RoPE with the compressed latent vectors by using a separate small RoPE key vector $k_t^R$ .

3.3 DeepSeekMoE: Fine-Grained Expert Segmentation and Shared Experts

Fine-Grained Expert Segmentation: Instead of N large experts with K active, DeepSeekMoE uses mN small experts with mK active. The total parameters and computation remain the same, but the combinations of active experts become vastly more diverse.

Traditional MoE: N=16 experts, K=2 active → C(16,2) = 120 combinations
DeepSeekMoE:     mN=64 experts, mK=8 active → C(64,8) ≈ 4.4B combinations

Shared Expert Isolation: Some experts are designated as Shared Experts that are always active for all tokens, handling common knowledge like grammar and common sense. This frees Routed Experts to focus on their specialized domains.

4. DeepSeek-V3 (685B, 37B Active): The Limits of Efficiency

4.1 Model Specifications

Item	Value
Total Parameters	671B
Active Parameters (per token)	37B
Context Length	128K
Training Data	14.8T tokens
MoE Structure	1 Shared Expert + 256 Routed Experts (8 active)
Training GPU Hours	2.788M H800 GPU hours
Training Cost (est.)	$5.576M

4.2 FP8 Mixed Precision Training

DeepSeek-V3 is the first publicly known large-scale model trained with FP8. Key innovations include unified E4M3 format, Fine-Grained Quantization (tile-wise and block-wise scaling), and High-Precision FP32 Accumulation.

4.3 Auxiliary-Loss-Free Load Balancing

Instead of using auxiliary loss functions (which face the dilemma of trading off model quality for balance), DeepSeek-V3 introduces Bias-Based Dynamic Balancing where each expert gets a dynamically adjusted bias term that affects only routing decisions, not the actual expert outputs.

4.4 Multi-Token Prediction (MTP)

DeepSeek-V3 adopts MTP as a training objective, predicting the next n tokens at each position. The implementation maintains causal chain ordering and can be used for Speculative Decoding during inference.

4.5 Training Cost: The $5.6M Controversy

The total training cost was approximately ** $5.576M**, compared to GPT-4's estimated$ 100M+. However, this figure only covers the final training run and excludes architecture exploration, hardware purchase costs, and personnel costs.

4.6 Benchmark Results

Benchmark	DeepSeek-V3	GPT-4o	Claude 3.5 Sonnet	Llama 3.1 405B
MMLU	88.5	87.2	88.7	88.6
HumanEval	82.6	80.5	81.1	72.0
MATH-500	90.2	76.6	78.3	73.8
AIME 2024	39.2	15.7	-	23.3
Codeforces	51.6	23.0	17.5	21.0

4.7 API Cost Comparison

DeepSeek-V3 is approximately 9x cheaper than GPT-4o for both input and output, with comparable or better performance.

5. DeepSeek-R1: Awakening Reasoning Through Reinforcement Learning

5.1 R1-Zero: Remarkable Results of Pure RL

R1-Zero applied RL directly to the DeepSeek-V3 Base model without any Supervised Fine-Tuning, using only accuracy rewards and format rewards.

"Aha Moment": During RL training, the model spontaneously developed advanced reasoning patterns including self-verification, self-reflection, dynamic strategy adaptation, and longer reasoning chains for difficult problems. These behaviors emerged without being explicitly taught.

5.2 Group Relative Policy Optimization (GRPO)

GRPO eliminates the Critic Model requirement of PPO by using group-relative rewards, reducing memory usage by nearly half while maintaining equal or better training stability.

5.3 DeepSeek-R1 Training Pipeline

The four-stage pipeline (Cold Start SFT -> Reasoning-focused RL -> Rejection Sampling + SFT -> Full-domain RL) preserves R1-Zero's reasoning ability while adding readability and general task capability.

5.4 Benchmark Results

Benchmark	DeepSeek-R1	OpenAI o1-1217
AIME 2024 (Pass@1)	79.8%	79.2%
MATH-500 (Pass@1)	97.3%	96.4%
SWE-Bench Verified	49.2%	48.9%

5.5 Distillation: Transferring Reasoning to Smaller Models

R1-Distill-Qwen-32B achieved 72.6% on AIME 2024, significantly outperforming OpenAI o1-mini (63.6%). Even the 1.5B model demonstrated math reasoning capabilities that sometimes surpassed GPT-4o.

6. DeepSeek-Coder: Coding-Specialized Models

DeepSeek-Coder V2 achieved 90.2% on HumanEval, surpassing GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro, demonstrating that open-source models can break through the closed-model barrier in code intelligence.

7. DeepSeek-VL / Janus: Vision-Language Models

Janus introduces an innovative approach of unifying multimodal understanding and visual generation in a single model through decoupled vision encoding paths processed by a single Unified Transformer.

8. MLA vs MHA vs GQA vs MQA Comparison

Method	KV Cache (Relative Size)	Performance Impact
MHA	100% (baseline)	Maximum (baseline)
MQA	~1.6%	Performance degradation
GQA (8 groups)	~12.5%	Slight degradation
MLA	~6.7%	No degradation / slight improvement

MLA uses a smaller KV Cache than GQA while maintaining performance equal to or better than MHA. This is why MLA is considered revolutionary.

9. Industrial Impact

9.1 NVIDIA Stock Shock

On January 27, 2025, NVIDIA's stock dropped 17%, evaporating approximately $589B in market cap in a single day -- the largest single-day market cap loss for any stock in US stock market history.

9.2 Impact on US AI Policy

DeepSeek's success questioned the US technology sanctions strategy against China, demonstrating that H800 GPUs (with restricted performance compared to A100/H100) were sufficient to build world-class models.

10. Limitations and Future Outlook

Current Limitations: Safety/alignment concerns due to Chinese AI regulations, multilingual performance variance, inefficient lengthy reasoning chains, lack of real-time knowledge, and hallucination.

Future Outlook: MLA architecture adoption spreading, GRPO becoming the de facto standard for LLM RL, cost-efficient training becoming a new competitive axis, and multimodal integration through Janus/Janus-Pro evolving toward universal models.

References

DeepSeek-AI. "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism." arXiv:2401.02954, 2024.
DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024.
DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024.
DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, 2025.
DeepSeek-AI. "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence." arXiv:2406.11931, 2024.