- Published on
Top LLM Papers 2024-2026 - Llama, DeepSeek, Qwen, Mistral, Phi, RLHF, DPO, CoT, RAG, FlashAttention, vLLM Reading List
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Prologue — Surviving the 2026 LLM Paper Firehose
Between January 2024 and May 2026, arXiv cs.CL and cs.LG averaged over 1,200 new submissions per week. Filter down to LLM-specific work and you still get ~300 a week, ~15,000 a year. No single person can read it all.
The question a working engineer asks in 2026 is therefore simple: "which 30 papers actually help the system I'm building today?"
This post curates that 30 plus a margin. Three criteria:
- Reproducible — code, weights, or enough detail to rebuild.
- Cited in the field — referenced in model cards, benchmark reports, production blog posts.
- Durable — the core insight survives the next model release six months out.
One-line summary: read in this order — foundation model reports → MoE / attention innovations → RLHF and DPO family → CoT and reasoning → agents and retrieval → FlashAttention and serving → evaluation and safety. One week and the whole 2026 LLM landscape is in your head.
1. Llama 3 — the New Open-Weight Baseline
Llama 3 / Llama 3.3 Technical Report (2024-07, arXiv:2407.21783)
Meta released Llama 3 across 8B, 70B, and 405B and effectively reset the open-weight baseline. The 92-page technical report documents the data curation pipeline (15T tokens), scaling law re-validation, post-training recipe (SFT + DPO + Rejection Sampling), and infrastructure (a 16K H100 cluster with 419 interruptions; the most common failures were GPU, then memory, then NIC). A single report is the de facto textbook on how a modern LLM is built. The 8B variant is still, in 2026, the most common fine-tuning base.
Llama 3.3 70B kept the architecture and only strengthened post-training, reaching GPT-4o-level instruction following. With Llama 4 shipping a multimodal MoE in mid-2025, "Llama equals open LLM standard" is now the working assumption.
2. DeepSeek-V3 and R1 — Peak MoE and Reasoning RL
DeepSeek-V3 Technical Report (2024-12, arXiv:2412.19437)
A 671B-parameter MoE trained on 14.8T tokens, reportedly for ~$5.58M of H800 time. That headline number shook the industry. The technical contributions worth knowing: MLA (Multi-head Latent Attention) compresses KV cache by ~10x; DeepSeekMoE uses 256 routed experts plus 1 shared expert; auxiliary-loss-free load balancing, FP8 training, and DualPipe pipeline parallelism are now standard references for follow-on open models.
DeepSeek-R1 (2025-01, arXiv:2501.12948)
R1 takes V3 as the base and reaches o1-class reasoning with pure RL. The key algorithm is GRPO (Group Relative Policy Optimization), which drops PPO's value network to save memory. The R1-Zero report — pure RL, no SFT — describes an "aha moment" where the model starts emitting self-review tokens like "Wait, let me reconsider…", one of the most-cited results of 2025.
3. The Qwen Series — Trilingual Strength from China
Qwen2.5 Technical Report (2024-12, arXiv:2412.15115) and Qwen3 Technical Report (2025-Q2) cover sizes from 0.5B up to 72B, plus 128K-context, multimodal, math- and code-specialized variants. The Qwen family often beats Llama on CJK (Chinese, Japanese, Korean) workloads, and Qwen2.5-Coder 32B held the top SWE-Bench score among open-weight coding models for some time. In 2026 it is the most common base for Korean and Japanese startups training their own models.
4. Mistral and Mistral Large 2 — Europe Responds
Mistral 7B (2023-10, arXiv:2310.06825) combined sliding-window attention and grouped-query attention to beat Llama 2 13B at 7B size — a milestone. In 2024 Mistral Large 2 (123B) and in 2025 Mistral Medium 3 shipped under Apache 2.0 or Mistral Research License, anchoring the European open-weight position. Mixtral 8x7B and Mixtral 8x22B defined the sparse-MoE standard before DeepSeek; Codestral at 22B is still a common code-specific pick.
5. The Phi Series — "Data Quality Equals Model Quality"
Phi-3 Technical Report (2024-04, arXiv:2404.14219) and Phi-4 (2024-12, arXiv:2412.08905) are the high point of the Microsoft Research SLM (small language model) line. The thesis is simple: train only on "textbook quality data" and a 3.8B model can beat GPT-3.5. Phi-4 at 14B caught Llama 3 70B on GPQA and MATH, and Phi-4-reasoning showed an o1-mini-class reasoner — evidence that SLMs can reason too.
6. Gemma 3 and Falcon 3 — The Rest of the Open Camp
Gemma 3 Technical Report (2025-Q1) ships 1B / 4B / 12B / 27B and ports some Gemini 2.0 internals (attention variants, distillation) to open weights. 128K context and multimodality come built in.
Falcon 3 (TII, UAE) and Command R+ (Cohere) emphasize English, Arabic, and multilingual RAG rather than CJK. Yi-Lightning (01.AI) and GLM-4-9B (Zhipu) are less known outside China but show up high on Chatbot Arena.
7. Commercial Model Cards — GPT-4, Claude 4.7, Gemini 2.5
For closed models the system card is the source of record, not the paper.
- GPT-4 Technical Report (2023, arXiv:2303.08774) — architecture details withheld but the evaluation methodology and safety procedures set a baseline.
- OpenAI o1 System Card (2024-09) — the first commercial reasoning model, RL plus CoT integrated at training.
- OpenAI o3 / o4 System Card (2025) — the first model to clear average human on ARC-AGI.
- Anthropic Claude 4 / 4.5 / 4.7 Model Card — successors to Constitutional AI, sycophancy mitigation, citation features, computer use capability descriptions.
- Google Gemini 1.5 / 2.0 / 2.5 Technical Report (arXiv:2403.05530) — 1M to 10M token context with native multimodality.
You read commercial cards for evaluation methodology, safety interventions, and limitations, not for benchmark numbers.
8. Mixture-of-Experts — Switch Transformer to DeepSeekMoE
MoE re-emerged in 2021 with Switch Transformer (arXiv:2101.03961), continued through GShard, GLaM, and ST-MoE, and stepped up again in 2024 with DeepSeekMoE (arXiv:2401.06066). Two ideas matter: fine-grained expert segmentation (more, smaller experts) and shared expert isolation (separate experts for common knowledge). DeepSeek-V3's 256+1 expert configuration follows directly.
Mixtral of Experts (arXiv:2401.04088) activates top-2 of 8 experts and is the most cited sparse-MoE implementation. OLMoE (Allen AI) was the first MoE to publish full training code and data.
9. Attention Innovations — MLA, GQA, Sliding Window, Mamba
GQA: Grouped-Query Attention (arXiv:2305.13245) — multiple query heads share KV heads. Default in Llama 2/3, Mistral, and nearly every modern model.
MLA: Multi-head Latent Attention (arXiv:2405.04434, DeepSeek-V2 paper) — low-rank compression of KV cache. ~80% memory savings at the same context.
Sliding Window Attention — used by Longformer (arXiv:2004.05150) and Mistral 7B. Local window plus global tokens.
Mamba / Mamba-2 (arXiv:2312.00752, arXiv:2405.21060) — SSM (state-space model) based. O(N) instead of O(N²) attention. Throughput wins on long context. Hybrid stacks (transformer + Mamba blocks) appeared experimentally in 2025-2026 — Jamba (AI21), Zamba2 (Zyphra).
RWKV-7 — attempts to match transformers with an RNN; a candidate for mobile and embedded.
10. The Reasoning Lineage — CoT, ToT, Self-Consistency, GRPO
Chain-of-Thought Prompting (arXiv:2201.11903, Wei et al. 2022) — "Let's think step by step" doubles GSM8K accuracy.
Self-Consistency (arXiv:2203.11171) — sample many, majority-vote. +10-20% over single-sample on reasoning tasks.
Tree-of-Thoughts (arXiv:2305.10601) — search the reasoning tree. Effective on Game of 24 and creative writing.
Reflexion (arXiv:2303.11366) — log failures as text memory for the next attempt.
OpenAI o1 (blog, 2024-09) and DeepSeek-R1 GRPO — surface long CoT through RL during training. The reason every 2026 frontier model has a "thinking" mode.
Inference-Time Scaling Laws (arXiv:2408.03314) — at fixed compute, spending more on inference can beat spending more on parameters.
# Inference-time scaling — Best-of-N with a verifier
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
def best_of_n(prompt, n=16, verifier=None):
inputs = tok(prompt, return_tensors="pt")
candidates = []
for _ in range(n):
out = model.generate(
**inputs,
do_sample=True,
temperature=0.8,
max_new_tokens=512,
)
text = tok.decode(out[0], skip_special_tokens=True)
score = verifier(text) if verifier else len(text)
candidates.append((score, text))
return max(candidates, key=lambda x: x[0])[1]
11. The RLHF Lineage — InstructGPT, Constitutional AI, DPO
InstructGPT (arXiv:2203.02155, Ouyang et al. 2022) — the canonical RLHF paper. PPO + reward model + KL penalty in three stages.
Constitutional AI (arXiv:2212.08073, Anthropic 2022) — replaces human preferences with an AI-authored constitution for self-critique. The origin of RLAIF.
DPO: Direct Preference Optimization (arXiv:2305.18290, Rafailov et al. 2023) — learns from preference pairs directly, no reward model. Eliminates PPO's machinery while matching performance. De facto standard since 2024.
ORPO (arXiv:2403.07691) — merges SFT and preference learning into one loss. Single-stage RLHF.
KTO: Kahneman-Tversky Optimization (arXiv:2402.01306) — trains from single labels (good/bad) instead of pairs. Lower labeling cost.
SimPO (arXiv:2405.14734) — drops the reference-model dependency of DPO. Memory savings.
Quick comparison:
| Algorithm | Reward model | Reference model | Label format |
|---|---|---|---|
| PPO (RLHF) | yes | yes | pair |
| DPO | no | yes | pair |
| ORPO | no | no | pair + SFT |
| KTO | no | yes | single |
| SimPO | no | no | pair |
12. Agents — ReAct, Voyager, SWE-Agent, OS-Atlas
ReAct (arXiv:2210.03629) — interleave reasoning and acting. Foundation of almost every LLM agent framework.
Voyager (arXiv:2305.16291) — lifelong-learning agent in Minecraft. Auto-builds a skill library.
SWE-Agent (arXiv:2405.15793) — designs an agent-computer interface (ACI) rather than reusing a human IDE. Pushed GPT-4 on SWE-Bench from 12.5% to 18.0%.
OS-Atlas (arXiv:2410.23218) — grounding model for GUI agents. Screen capture to coordinates/actions.
Computer Use survey — after Anthropic's Claude Computer Use (2024-10) the field got a proper benchmark in OSWorld (arXiv:2404.07972).
# Minimal ReAct pseudo-code
def react_agent(task, tools, llm, max_steps=10):
trajectory = [f"Task: {task}"]
for step in range(max_steps):
thought = llm(trajectory + ["Thought:"])
action = llm(trajectory + ["Action:"])
if action.startswith("Finish"):
return action
observation = tools.run(action)
trajectory.append(f"Thought: {thought}\nAction: {action}\nObservation: {observation}")
return "Max steps reached"
13. The RAG Lineage — From the Original to GraphRAG
RAG (Retrieval-Augmented Generation) (arXiv:2005.11401, Lewis et al. 2020) — combined retrieval and generation. Standard for open-domain QA.
FiD: Fusion-in-Decoder (arXiv:2007.01282) — fuses passages inside the decoder. Stronger than RAG but with higher decoder context cost.
RETRO (arXiv:2112.04426, DeepMind) — 2T-token data store outside the model; chunk-wise retrieval.
ColBERT / ColBERTv2 (arXiv:2004.12832) — late interaction. Token-level query-document matching, the accuracy standard for dense retrieval.
Self-RAG (arXiv:2310.11511) — the model decides whether to retrieve and emits self-reflection tokens.
GraphRAG (arXiv:2404.16130, Microsoft 2024) — turns documents into a knowledge graph and searches via community summaries. Strong on global queries (summary, trend).
Contextual Retrieval (Anthropic blog, 2024-09) — prepend a context prefix to each chunk before embedding. Cuts retrieval failure from 49% to 35%.
14. FlashAttention 1/2/3 — Rediscovering the Memory Hierarchy
FlashAttention (arXiv:2205.14135, Dao et al. 2022) — tile attention to stay inside SRAM. Cuts HBM I/O for a 7.6x speedup.
FlashAttention-2 (arXiv:2307.08691) — re-architected work partitioning. 2x faster. Most training stacks migrated.
FlashAttention-3 (arXiv:2407.08608) — exploits Hopper (H100/H200) async wgmma and TMA. 75% MFU on FP16, 1.2 PFLOPS on FP8.
# Calling FlashAttention from torch — 2026 standard
import torch
import torch.nn.functional as F
q = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)
k = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)
v = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)
# PyTorch 2.x SDPA picks the FlashAttention backend automatically
with torch.backends.cuda.sdp_kernel(
enable_flash=True, enable_math=False, enable_mem_efficient=False
):
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
print(out.shape) # [2, 8, 4096, 128]
15. vLLM and SGLang — Serving Infrastructure Standards
vLLM PagedAttention (arXiv:2309.06180, Kwon et al. 2023) — manages KV cache like OS paging. Memory fragmentation drops from ~90% to ~4%. Throughput 2-4x over HuggingFace TGI or NVIDIA Triton.
SGLang RadixAttention (arXiv:2312.07104) — shares KV cache in a radix tree. 5x faster on multi-turn or few-shot workloads with overlapping system prompts.
Mixture-of-Depths (arXiv:2404.02258, DeepMind 2024) — dynamically skips transformer layers per token. Same quality, fewer FLOPS.
Speculative Decoding (arXiv:2211.17192, Leviathan et al. 2022) — a small draft model proposes tokens, the large model verifies. Base 2-3x speedup.
# vLLM standard serving config — 2026 production pattern
docker run --gpus all -p 8000:8000 \
-v ~/models:/models \
vllm/vllm-openai:latest \
--model /models/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--enable-chunked-prefill
16. Long Context — RoPE, YaRN, LongLoRA
RoPE: Rotary Positional Embedding (arXiv:2104.09864) — the Llama-family positional encoding standard.
YaRN (arXiv:2309.00071) — NTK-aware scaling of RoPE. Extends a 4K-trained model to 128K.
LongLoRA (arXiv:2309.12307) — sparse local attention plus LoRA for efficient context extension.
RingAttention (arXiv:2310.01889) — ring-topology KV communication across devices. Enables training at 1M+ context.
Activation Beacon (arXiv:2401.03462) — compress context into beacon tokens. Efficient retrieval.
Gemini 1.5 Pro at 1M tokens and Gemini 2.5 at 10M sit on combinations of these techniques.
17. Code LLMs — StarCoder, DeepSeek Coder, Codestral
StarCoder 2 (arXiv:2402.19173, BigCode 2024) — 619 programming languages, 4T+ tokens. Full weights and training data open.
DeepSeek Coder V2 (arXiv:2406.11931) — 236B MoE, 21B active. Matches GPT-4 Turbo on HumanEval and MBPP. V3 scales to 671B MoE.
Codestral (Mistral, 2024-05) — 22B, 80 languages, 32K context. Frequent pick for IDE integrations.
Code Llama (arXiv:2308.12950) — Code-specialized Llama 2 variant. Code Llama 70B briefly led open-weight coding.
Qwen2.5-Coder (32B) — Qwen's coding variant. Held #1 open SWE-Bench for a while.
18. Small Models — The SLM Renaissance
One of the bigger 2024-2026 shifts: "small can punch above its weight."
- Phi-3.5 Mini (3.8B) — strong general model that runs on phones.
- Gemma 2B / 3 1B — edge-friendly 1B-class.
- Qwen2.5 3B / 7B — multilingual SLM standards.
- Mistral 7B / Mistral Nemo 12B — classic-size standards.
- SmolLM2 (arXiv:2502.02737) — 360M and 1.7B trained on 11T tokens. SmolLM-Corpus data catalog shipped alongside.
- TinyLlama (arXiv:2401.02385) — 1.1B trained on 3T tokens.
In 2026 most mobile and embedded LLMs derive from this set.
19. Evaluation — From MMLU and HumanEval to SWE-Bench and OSWorld
Traditional benchmarks:
- MMLU (arXiv:2009.03300) — 57-domain multiple choice.
- GSM8K (arXiv:2110.14168) — grade-school math.
- MATH (arXiv:2103.03874) — competition math.
- HumanEval (arXiv:2107.03374) — code completion.
- BIG-Bench Hard (arXiv:2210.09261).
2024-2026 next generation:
- GPQA (arXiv:2311.12022) — PhD-level STEM.
- MMLU-Pro (arXiv:2406.01574) — shuffled answers and harder MMLU.
- ARC-AGI (Chollet) — general-intelligence test. o3 first to clear average human.
- SWE-Bench (arXiv:2310.06770) and SWE-Bench Verified — real GitHub issue resolution.
- OSWorld (arXiv:2404.07972) — computer-use agents.
- MMMU (arXiv:2311.16502) — multimodal multiple choice.
- LMSYS Chatbot Arena (arXiv:2403.04132) — human pairwise vote, ELO.
In 2026 GSM8K and HumanEval are saturated on frontier models; the meaningful signal moved to SWE-Bench, OSWorld, GPQA, and ARC-AGI.
20. Main Model Comparison Table
| Model | Released | Size | MMLU | HumanEval | GSM8K | SWE-Bench |
|---|---|---|---|---|---|---|
| Llama 3.1 70B | 2024-07 | 70B | 86.0 | 80.5 | 95.1 | 31.2 |
| Llama 3.3 70B | 2024-12 | 70B | 86.9 | 88.4 | 96.5 | 41.4 |
| DeepSeek-V3 | 2024-12 | 671B MoE | 88.5 | 89.0 | 89.3 | 42.0 |
| DeepSeek-R1 | 2025-01 | 671B MoE | 91.2 | 96.3 | 97.3 | 49.2 |
| Qwen2.5-72B | 2024-09 | 72B | 86.1 | 86.6 | 95.8 | 36.0 |
| Mistral Large 2 | 2024-07 | 123B | 84.0 | 92.0 | 93.0 | 32.0 |
| Phi-4 | 2024-12 | 14B | 84.8 | 82.6 | 80.4 | - |
| Gemma 3 27B | 2025-Q1 | 27B | 81.0 | 79.8 | 89.2 | 28.5 |
| GPT-4o | 2024-05 | ? | 88.7 | 90.2 | 95.8 | 33.2 |
| Claude 4.7 | 2026 | ? | 90.1 | 96.3 | 96.4 | 65+ |
| Gemini 2.5 Pro | 2025 | ? | 89.8 | 92.0 | 95.4 | 51.0 |
Numbers from each model card or the LMSYS / Open LLM Leaderboard averages. Don't read the table as a ranking — read it for "which axes saturate and which still have headroom each generation."
21. Safety and Alignment — Constitutional AI, Sycophancy, Refusal
Constitutional AI (arXiv:2212.08073) opened the door to reducing human labels in RLHF via model self-critique.
Discovering Language Model Behaviors with Model-Written Evaluations (arXiv:2212.09251) — measure subtle alignment failures like sycophancy using the model itself.
Universal and Transferable Adversarial Attacks on Aligned Language Models (arXiv:2307.15043, the GCG attack) — adversarial suffixes can break alignment in a systematic way.
Jailbreak Survey (arXiv:2402.13457) — taxonomy of jailbreaks through 2024.
Sleeper Agents (arXiv:2401.05566, Anthropic) — training-time backdoors survive standard safety training. An important paper on the limits of alignment.
Tamper-Resistant Safeguards (arXiv:2408.00761) — attempts to make open-weight safety robust to additional fine-tuning.
22. Korean Models — HyperCLOVA X, EXAONE 3.5, Kanana
HyperCLOVA X Technical Report (arXiv:2404.01954, Naver 2024) — Korean-English bilingual plus Korean culture, law, and medical evaluation sets (KoBigBench, KMMLU). The de facto baseline report for Korean LLMs.
EXAONE 3.5 (LG AI Research, 2024-12) — 2.4B / 7.8B / 32B. English-Korean bilingual, 32K context. Released under the EXAONE AI Model License rather than Apache 2.0, but research use is permitted.
Kanana (Kakao, 2025) — 2B / 8B / 32B. Korean and English. Internal LLM backbone for KakaoTalk.
KORAi / KORani / KoGPT / Polyglot-Ko — earlier Korean models. From 2025 the three above are the practical majors.
KMMLU (arXiv:2402.11548) — Korean MMLU. The default Korean-LLM evaluation.
23. Japanese Models — Sakana, Stockmark, Swallow, PLaMo
Sakana AI Evolutionary Optimization of Model Merging Recipes (arXiv:2403.13187) — evolution-algorithm-driven automatic multilingual model merging. EvoLLM-JP marked a new direction for Japanese LLMs.
Stockmark-100b (Stockmark, 2024) — 100B Japanese-English bilingual model trained on a Japanese business corpus.
Swallow (Tokyo Tech, arXiv:2404.17790) — continual pretraining of Llama 2/3 on Japanese corpora.
PLaMo 2 / 100B (Preferred Networks) — Japanese, English, code. PFN's own training corpus.
NEC cotomi — Japanese business-domain LLM. 130B and 7B variants.
Rakuten AI 7B, Karasu, Stable LM Japanese and other 7B-class Japanese models are plentiful.
JGLUE / Japanese MT-Bench — Japanese evaluation standards.
24. Data — Dolma, RedPajama, FineWeb
The three open-data majors.
- Dolma (arXiv:2402.00159, AI2) — 3T tokens. Used for OLMo training.
- RedPajama-Data-v2 (Together AI, 2023-10) — 30T tokens. Multilingual plus English.
- FineWeb (arXiv:2406.17557, HuggingFace) — 15T tokens plus FineWeb-Edu 1.3T variant.
The Pile (arXiv:2101.00027, EleutherAI) — the 800GB starting point of open LLMs in 2021.
Common Crawl and the cleanup pipelines on top of it (CCNet, DataComp-LM, TxT360, Nemotron-CC) are the 2026 standards for open-data rationalization.
25. Multimodal — LLaVA, CogVLM, Qwen-VL, Pixtral
LLaVA (arXiv:2304.08485, 2023) — Vicuna plus a CLIP visual encoder plus a projection. The starting point of open multimodal.
LLaVA-1.5 / LLaVA-NeXT — better resolution handling and multi-turn.
Qwen-VL / Qwen2-VL (arXiv:2308.12966, arXiv:2409.12191) — arbitrary resolution and multilingual OCR. Qwen2.5-VL adds video.
Pixtral 12B (Mistral, 2024-09) — Pixtral's vision encoder handles arbitrary-resolution patches.
Idefics 3 (HuggingFace) — open data plus open weights multimodal.
Molmo (AI2, arXiv:2409.17146) — pointing as a training task. Strong fit for agent stacks.
26. Reading Order — A Curated 30 for 2026 Engineers
If you can only read 30, do them in this order:
- Llama 3 Technical Report — the full picture of modern LLM training.
- DeepSeek-V3 Technical Report — peak cost-efficient training.
- DeepSeek-R1 — RL-based reasoning.
- Mixtral of Experts — the MoE standard.
- DeepSeekMoE — fine-grained MoE.
- GQA and MLA — two axes of attention efficiency.
- FlashAttention-2 — the training-speed standard.
- vLLM PagedAttention — the serving standard.
- SGLang RadixAttention — KV cache sharing.
- CoT Prompting — the reasoning starting point.
- DPO — the post-training standard.
- Constitutional AI — origin of RLAIF.
- ReAct — the agent starting point.
- SWE-Agent — code-agent standard.
- OSWorld — computer-use evaluation.
- RAG (the original) — retrieval combination.
- ColBERTv2 — dense retrieval accuracy.
- GraphRAG — global RAG.
- Self-RAG — self-retrieval.
- YaRN — RoPE scaling.
- RingAttention — long-context training.
- Speculative Decoding — decoding acceleration.
- Phi-3 / Phi-4 — SLM renaissance.
- SmolLM2 — open SLM data.
- MMLU and GPQA — the evaluation baselines.
- SWE-Bench Verified — code evaluation.
- LMSYS Chatbot Arena — human preference.
- Sleeper Agents — limits of alignment.
- HyperCLOVA X — Korean LLM baseline.
- Sakana EvoLLM — model merging.
One paper a week for 30 weeks, or 30 days flat, gets the whole 2026 LLM landscape into your head.
References
- arxiv.org — https://arxiv.org/
- Llama 3 Technical Report — https://arxiv.org/abs/2407.21783
- DeepSeek-V3 Technical Report — https://arxiv.org/abs/2412.19437
- DeepSeek-R1 — https://arxiv.org/abs/2501.12948
- Qwen2.5 Technical Report — https://arxiv.org/abs/2412.15115
- Mistral 7B — https://arxiv.org/abs/2310.06825
- Mixtral of Experts — https://arxiv.org/abs/2401.04088
- Phi-3 Technical Report — https://arxiv.org/abs/2404.14219
- Phi-4 — https://arxiv.org/abs/2412.08905
- Gemini 1.5 — https://arxiv.org/abs/2403.05530
- Switch Transformer — https://arxiv.org/abs/2101.03961
- DeepSeekMoE — https://arxiv.org/abs/2401.06066
- GQA — https://arxiv.org/abs/2305.13245
- MLA / DeepSeek-V2 — https://arxiv.org/abs/2405.04434
- Mamba — https://arxiv.org/abs/2312.00752
- Mamba-2 — https://arxiv.org/abs/2405.21060
- Chain-of-Thought — https://arxiv.org/abs/2201.11903
- Self-Consistency — https://arxiv.org/abs/2203.11171
- Tree-of-Thoughts — https://arxiv.org/abs/2305.10601
- Inference-Time Scaling — https://arxiv.org/abs/2408.03314
- InstructGPT — https://arxiv.org/abs/2203.02155
- Constitutional AI — https://arxiv.org/abs/2212.08073
- DPO — https://arxiv.org/abs/2305.18290
- ORPO — https://arxiv.org/abs/2403.07691
- KTO — https://arxiv.org/abs/2402.01306
- SimPO — https://arxiv.org/abs/2405.14734
- ReAct — https://arxiv.org/abs/2210.03629
- Voyager — https://arxiv.org/abs/2305.16291
- SWE-Agent — https://arxiv.org/abs/2405.15793
- OS-Atlas — https://arxiv.org/abs/2410.23218
- OSWorld — https://arxiv.org/abs/2404.07972
- RAG — https://arxiv.org/abs/2005.11401
- FiD — https://arxiv.org/abs/2007.01282
- RETRO — https://arxiv.org/abs/2112.04426
- ColBERT — https://arxiv.org/abs/2004.12832
- Self-RAG — https://arxiv.org/abs/2310.11511
- GraphRAG — https://arxiv.org/abs/2404.16130
- FlashAttention — https://arxiv.org/abs/2205.14135
- FlashAttention-2 — https://arxiv.org/abs/2307.08691
- FlashAttention-3 — https://arxiv.org/abs/2407.08608
- vLLM PagedAttention — https://arxiv.org/abs/2309.06180
- SGLang — https://arxiv.org/abs/2312.07104
- Speculative Decoding — https://arxiv.org/abs/2211.17192
- Mixture-of-Depths — https://arxiv.org/abs/2404.02258
- RoPE — https://arxiv.org/abs/2104.09864
- YaRN — https://arxiv.org/abs/2309.00071
- LongLoRA — https://arxiv.org/abs/2309.12307
- RingAttention — https://arxiv.org/abs/2310.01889
- Activation Beacon — https://arxiv.org/abs/2401.03462
- StarCoder 2 — https://arxiv.org/abs/2402.19173
- DeepSeek Coder V2 — https://arxiv.org/abs/2406.11931
- Code Llama — https://arxiv.org/abs/2308.12950
- MMLU — https://arxiv.org/abs/2009.03300
- GSM8K — https://arxiv.org/abs/2110.14168
- MATH — https://arxiv.org/abs/2103.03874
- HumanEval — https://arxiv.org/abs/2107.03374
- GPQA — https://arxiv.org/abs/2311.12022
- SWE-Bench — https://arxiv.org/abs/2310.06770
- MMMU — https://arxiv.org/abs/2311.16502
- LMSYS Chatbot Arena — https://arxiv.org/abs/2403.04132
- HyperCLOVA X — https://arxiv.org/abs/2404.01954
- KMMLU — https://arxiv.org/abs/2402.11548
- Sakana EvoLLM — https://arxiv.org/abs/2403.13187
- Swallow — https://arxiv.org/abs/2404.17790
- Sleeper Agents — https://arxiv.org/abs/2401.05566
- HuggingFace — https://huggingface.co/
- Meta AI Research — https://ai.meta.com/research/
- DeepSeek — https://www.deepseek.com/
- Qwen — https://qwenlm.github.io/
- Mistral AI — https://mistral.ai/news/
- OpenAI Research — https://openai.com/research/
- Anthropic Research — https://www.anthropic.com/research
- Google DeepMind Research — https://deepmind.google/research/
- vLLM — https://github.com/vllm-project/vllm
- SGLang — https://github.com/sgl-project/sglang