Skip to content
Published on

DeepSeek Model Complete Analysis: From MLA and MoE to RL-Based Reasoning — Everything About the Chinese Open-Source LLM Innovation

Authors
  • Name
    Twitter

1. Introduction to DeepSeek

1.1 Company Background: From Hedge Fund to AI Research Lab

DeepSeek was founded in July 2023 in Hangzhou, China. Its founder, Liang Wenfeng, is also the CEO of High-Flyer, a quant hedge fund established in 2016. High-Flyer grew into one of China's largest hedge funds through AI-based algorithmic trading.

High-Flyer possessed large-scale GPU clusters for trading, and after declaring the establishment of an AGI research lab in April 2023, spun off DeepSeek as an independent entity in July of the same year. DeepSeek is thus a unique AI research organization born from a hedge fund's financial power and GPU infrastructure.

1.2 Open-Source Philosophy

DeepSeek's most distinctive feature is its fully open-source strategy. DeepSeek releases all model weights under MIT License or commercially permissive licenses, and publishes detailed technical reports on arXiv for every model. This stands in stark contrast to the closed approaches of OpenAI and Anthropic, forming one of the two pillars of the open-source LLM ecosystem alongside Meta's Llama series.

1.3 Position in the Chinese AI Ecosystem

Among Chinese AI players (Baidu ERNIE, Alibaba Qwen, ByteDance Doubao, Zhipu AI GLM, Moonshot AI Kimi), DeepSeek distinguishes itself through:

  • Focus on fundamental research over product launches
  • Fully open-source: Most aggressive open-source strategy among Chinese AI companies
  • Cost efficiency: Proven ability to train world-class models at a fraction of US costs
  • Independent funding: Operating on High-Flyer's own capital without VC dependency

2. DeepSeek-V1 / DeepSeek LLM (67B)

2.1 Model Overview

DeepSeek's first foundation model, DeepSeek LLM, was released in January 2024 in 7B and 67B sizes.

ItemDeepSeek LLM 7BDeepSeek LLM 67B
Parameters7B67B
Training Data2T tokens2T tokens
Context Length4K4K
ArchitectureDense TransformerDense Transformer

2.2 Scaling Law Research

The paper's most important contribution was independent scaling law research, going beyond the Chinchilla Scaling Law to present new findings on batch size scaling, learning rate scaling, and data-model allocation strategies.


3. DeepSeek-V2 (236B): Birth of MLA and DeepSeekMoE

DeepSeek-V2, released in May 2024, introduced two of DeepSeek's most critical architectural innovations: Multi-head Latent Attention (MLA) and DeepSeekMoE.

3.1 Model Specifications

ItemValue
Total Parameters236B
Active Parameters (per token)21B
Context Length128K
Training Data8.1T tokens
MoE Structure2 Shared Experts + 160 Routed Experts (6 active)
AttentionMulti-head Latent Attention (MLA)

3.2 Multi-head Latent Attention (MLA)

MLA compresses KV Cache by 93.3% while achieving performance equal to or better than standard Multi-head Attention (MHA). This fundamentally addresses the KV Cache memory bottleneck in LLM inference.

Core Idea: Low-Rank KV Joint Compression

Instead of reducing KV Head count, MLA jointly compresses KV into a low-dimensional latent vector. During inference, only this small latent vector is cached, and it is decompressed to the original dimension when needed.

Step 1: Down-Projection (Compression)

ctKV=WDKVhtc_t^{KV} = W^{DKV} h_t

Step 2: Up-Projection (Decompression)

ktC=WUKctKV,vtC=WUVctKVk_t^C = W^{UK} c_t^{KV}, \quad v_t^C = W^{UV} c_t^{KV}

The key point: what is stored in the cache is the compressed latent vector ctKVc_t^{KV}, not the original K, V vectors.

Decoupled Rotary Position Embedding solves the challenge of integrating RoPE with the compressed latent vectors by using a separate small RoPE key vector ktRk_t^R.

3.3 DeepSeekMoE: Fine-Grained Expert Segmentation and Shared Experts

Fine-Grained Expert Segmentation: Instead of N large experts with K active, DeepSeekMoE uses mN small experts with mK active. The total parameters and computation remain the same, but the combinations of active experts become vastly more diverse.

Traditional MoE: N=16 experts, K=2 active → C(16,2) = 120 combinations
DeepSeekMoE:     mN=64 experts, mK=8 active → C(64,8)4.4B combinations

Shared Expert Isolation: Some experts are designated as Shared Experts that are always active for all tokens, handling common knowledge like grammar and common sense. This frees Routed Experts to focus on their specialized domains.


4. DeepSeek-V3 (685B, 37B Active): The Limits of Efficiency

4.1 Model Specifications

ItemValue
Total Parameters671B
Active Parameters (per token)37B
Context Length128K
Training Data14.8T tokens
MoE Structure1 Shared Expert + 256 Routed Experts (8 active)
Training GPU Hours2.788M H800 GPU hours
Training Cost (est.)$5.576M

4.2 FP8 Mixed Precision Training

DeepSeek-V3 is the first publicly known large-scale model trained with FP8. Key innovations include unified E4M3 format, Fine-Grained Quantization (tile-wise and block-wise scaling), and High-Precision FP32 Accumulation.

4.3 Auxiliary-Loss-Free Load Balancing

Instead of using auxiliary loss functions (which face the dilemma of trading off model quality for balance), DeepSeek-V3 introduces Bias-Based Dynamic Balancing where each expert gets a dynamically adjusted bias term that affects only routing decisions, not the actual expert outputs.

4.4 Multi-Token Prediction (MTP)

DeepSeek-V3 adopts MTP as a training objective, predicting the next n tokens at each position. The implementation maintains causal chain ordering and can be used for Speculative Decoding during inference.

4.5 Training Cost: The $5.6M Controversy

The total training cost was approximately **5.576M,comparedtoGPT4sestimated5.576M**, compared to GPT-4's estimated 100M+. However, this figure only covers the final training run and excludes architecture exploration, hardware purchase costs, and personnel costs.

4.6 Benchmark Results

BenchmarkDeepSeek-V3GPT-4oClaude 3.5 SonnetLlama 3.1 405B
MMLU88.587.288.788.6
HumanEval82.680.581.172.0
MATH-50090.276.678.373.8
AIME 202439.215.7-23.3
Codeforces51.623.017.521.0

4.7 API Cost Comparison

DeepSeek-V3 is approximately 9x cheaper than GPT-4o for both input and output, with comparable or better performance.


5. DeepSeek-R1: Awakening Reasoning Through Reinforcement Learning

5.1 R1-Zero: Remarkable Results of Pure RL

R1-Zero applied RL directly to the DeepSeek-V3 Base model without any Supervised Fine-Tuning, using only accuracy rewards and format rewards.

"Aha Moment": During RL training, the model spontaneously developed advanced reasoning patterns including self-verification, self-reflection, dynamic strategy adaptation, and longer reasoning chains for difficult problems. These behaviors emerged without being explicitly taught.

5.2 Group Relative Policy Optimization (GRPO)

GRPO eliminates the Critic Model requirement of PPO by using group-relative rewards, reducing memory usage by nearly half while maintaining equal or better training stability.

5.3 DeepSeek-R1 Training Pipeline

The four-stage pipeline (Cold Start SFT -> Reasoning-focused RL -> Rejection Sampling + SFT -> Full-domain RL) preserves R1-Zero's reasoning ability while adding readability and general task capability.

5.4 Benchmark Results

BenchmarkDeepSeek-R1OpenAI o1-1217
AIME 2024 (Pass@1)79.8%79.2%
MATH-500 (Pass@1)97.3%96.4%
SWE-Bench Verified49.2%48.9%

5.5 Distillation: Transferring Reasoning to Smaller Models

R1-Distill-Qwen-32B achieved 72.6% on AIME 2024, significantly outperforming OpenAI o1-mini (63.6%). Even the 1.5B model demonstrated math reasoning capabilities that sometimes surpassed GPT-4o.


6. DeepSeek-Coder: Coding-Specialized Models

DeepSeek-Coder V2 achieved 90.2% on HumanEval, surpassing GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro, demonstrating that open-source models can break through the closed-model barrier in code intelligence.


7. DeepSeek-VL / Janus: Vision-Language Models

Janus introduces an innovative approach of unifying multimodal understanding and visual generation in a single model through decoupled vision encoding paths processed by a single Unified Transformer.


8. MLA vs MHA vs GQA vs MQA Comparison

MethodKV Cache (Relative Size)Performance Impact
MHA100% (baseline)Maximum (baseline)
MQA~1.6%Performance degradation
GQA (8 groups)~12.5%Slight degradation
MLA~6.7%No degradation / slight improvement

MLA uses a smaller KV Cache than GQA while maintaining performance equal to or better than MHA. This is why MLA is considered revolutionary.


9. Industrial Impact

9.1 NVIDIA Stock Shock

On January 27, 2025, NVIDIA's stock dropped 17%, evaporating approximately $589B in market cap in a single day -- the largest single-day market cap loss for any stock in US stock market history.

9.2 Impact on US AI Policy

DeepSeek's success questioned the US technology sanctions strategy against China, demonstrating that H800 GPUs (with restricted performance compared to A100/H100) were sufficient to build world-class models.


10. Limitations and Future Outlook

Current Limitations: Safety/alignment concerns due to Chinese AI regulations, multilingual performance variance, inefficient lengthy reasoning chains, lack of real-time knowledge, and hallucination.

Future Outlook: MLA architecture adoption spreading, GRPO becoming the de facto standard for LLM RL, cost-efficient training becoming a new competitive axis, and multimodal integration through Janus/Janus-Pro evolving toward universal models.


References

  1. DeepSeek-AI. "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism." arXiv:2401.02954, 2024.
  2. DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024.
  3. DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024.
  4. DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, 2025.
  5. DeepSeek-AI. "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence." arXiv:2406.11931, 2024.