Skip to content
Published on

Top LLM Papers 2024-2026 - Llama, DeepSeek, Qwen, Mistral, Phi, RLHF, DPO, CoT, RAG, FlashAttention, vLLM Reading List

Authors

Prologue — Surviving the 2026 LLM Paper Firehose

Between January 2024 and May 2026, arXiv cs.CL and cs.LG averaged over 1,200 new submissions per week. Filter down to LLM-specific work and you still get ~300 a week, ~15,000 a year. No single person can read it all.

The question a working engineer asks in 2026 is therefore simple: "which 30 papers actually help the system I'm building today?"

This post curates that 30 plus a margin. Three criteria:

  • Reproducible — code, weights, or enough detail to rebuild.
  • Cited in the field — referenced in model cards, benchmark reports, production blog posts.
  • Durable — the core insight survives the next model release six months out.

One-line summary: read in this order — foundation model reports → MoE / attention innovations → RLHF and DPO family → CoT and reasoning → agents and retrieval → FlashAttention and serving → evaluation and safety. One week and the whole 2026 LLM landscape is in your head.


1. Llama 3 — the New Open-Weight Baseline

Llama 3 / Llama 3.3 Technical Report (2024-07, arXiv:2407.21783)

Meta released Llama 3 across 8B, 70B, and 405B and effectively reset the open-weight baseline. The 92-page technical report documents the data curation pipeline (15T tokens), scaling law re-validation, post-training recipe (SFT + DPO + Rejection Sampling), and infrastructure (a 16K H100 cluster with 419 interruptions; the most common failures were GPU, then memory, then NIC). A single report is the de facto textbook on how a modern LLM is built. The 8B variant is still, in 2026, the most common fine-tuning base.

Llama 3.3 70B kept the architecture and only strengthened post-training, reaching GPT-4o-level instruction following. With Llama 4 shipping a multimodal MoE in mid-2025, "Llama equals open LLM standard" is now the working assumption.


2. DeepSeek-V3 and R1 — Peak MoE and Reasoning RL

DeepSeek-V3 Technical Report (2024-12, arXiv:2412.19437)

A 671B-parameter MoE trained on 14.8T tokens, reportedly for ~$5.58M of H800 time. That headline number shook the industry. The technical contributions worth knowing: MLA (Multi-head Latent Attention) compresses KV cache by ~10x; DeepSeekMoE uses 256 routed experts plus 1 shared expert; auxiliary-loss-free load balancing, FP8 training, and DualPipe pipeline parallelism are now standard references for follow-on open models.

DeepSeek-R1 (2025-01, arXiv:2501.12948)

R1 takes V3 as the base and reaches o1-class reasoning with pure RL. The key algorithm is GRPO (Group Relative Policy Optimization), which drops PPO's value network to save memory. The R1-Zero report — pure RL, no SFT — describes an "aha moment" where the model starts emitting self-review tokens like "Wait, let me reconsider…", one of the most-cited results of 2025.


3. The Qwen Series — Trilingual Strength from China

Qwen2.5 Technical Report (2024-12, arXiv:2412.15115) and Qwen3 Technical Report (2025-Q2) cover sizes from 0.5B up to 72B, plus 128K-context, multimodal, math- and code-specialized variants. The Qwen family often beats Llama on CJK (Chinese, Japanese, Korean) workloads, and Qwen2.5-Coder 32B held the top SWE-Bench score among open-weight coding models for some time. In 2026 it is the most common base for Korean and Japanese startups training their own models.


4. Mistral and Mistral Large 2 — Europe Responds

Mistral 7B (2023-10, arXiv:2310.06825) combined sliding-window attention and grouped-query attention to beat Llama 2 13B at 7B size — a milestone. In 2024 Mistral Large 2 (123B) and in 2025 Mistral Medium 3 shipped under Apache 2.0 or Mistral Research License, anchoring the European open-weight position. Mixtral 8x7B and Mixtral 8x22B defined the sparse-MoE standard before DeepSeek; Codestral at 22B is still a common code-specific pick.


5. The Phi Series — "Data Quality Equals Model Quality"

Phi-3 Technical Report (2024-04, arXiv:2404.14219) and Phi-4 (2024-12, arXiv:2412.08905) are the high point of the Microsoft Research SLM (small language model) line. The thesis is simple: train only on "textbook quality data" and a 3.8B model can beat GPT-3.5. Phi-4 at 14B caught Llama 3 70B on GPQA and MATH, and Phi-4-reasoning showed an o1-mini-class reasoner — evidence that SLMs can reason too.


6. Gemma 3 and Falcon 3 — The Rest of the Open Camp

Gemma 3 Technical Report (2025-Q1) ships 1B / 4B / 12B / 27B and ports some Gemini 2.0 internals (attention variants, distillation) to open weights. 128K context and multimodality come built in.

Falcon 3 (TII, UAE) and Command R+ (Cohere) emphasize English, Arabic, and multilingual RAG rather than CJK. Yi-Lightning (01.AI) and GLM-4-9B (Zhipu) are less known outside China but show up high on Chatbot Arena.


7. Commercial Model Cards — GPT-4, Claude 4.7, Gemini 2.5

For closed models the system card is the source of record, not the paper.

  • GPT-4 Technical Report (2023, arXiv:2303.08774) — architecture details withheld but the evaluation methodology and safety procedures set a baseline.
  • OpenAI o1 System Card (2024-09) — the first commercial reasoning model, RL plus CoT integrated at training.
  • OpenAI o3 / o4 System Card (2025) — the first model to clear average human on ARC-AGI.
  • Anthropic Claude 4 / 4.5 / 4.7 Model Card — successors to Constitutional AI, sycophancy mitigation, citation features, computer use capability descriptions.
  • Google Gemini 1.5 / 2.0 / 2.5 Technical Report (arXiv:2403.05530) — 1M to 10M token context with native multimodality.

You read commercial cards for evaluation methodology, safety interventions, and limitations, not for benchmark numbers.


8. Mixture-of-Experts — Switch Transformer to DeepSeekMoE

MoE re-emerged in 2021 with Switch Transformer (arXiv:2101.03961), continued through GShard, GLaM, and ST-MoE, and stepped up again in 2024 with DeepSeekMoE (arXiv:2401.06066). Two ideas matter: fine-grained expert segmentation (more, smaller experts) and shared expert isolation (separate experts for common knowledge). DeepSeek-V3's 256+1 expert configuration follows directly.

Mixtral of Experts (arXiv:2401.04088) activates top-2 of 8 experts and is the most cited sparse-MoE implementation. OLMoE (Allen AI) was the first MoE to publish full training code and data.


9. Attention Innovations — MLA, GQA, Sliding Window, Mamba

GQA: Grouped-Query Attention (arXiv:2305.13245) — multiple query heads share KV heads. Default in Llama 2/3, Mistral, and nearly every modern model.

MLA: Multi-head Latent Attention (arXiv:2405.04434, DeepSeek-V2 paper) — low-rank compression of KV cache. ~80% memory savings at the same context.

Sliding Window Attention — used by Longformer (arXiv:2004.05150) and Mistral 7B. Local window plus global tokens.

Mamba / Mamba-2 (arXiv:2312.00752, arXiv:2405.21060) — SSM (state-space model) based. O(N) instead of O(N²) attention. Throughput wins on long context. Hybrid stacks (transformer + Mamba blocks) appeared experimentally in 2025-2026 — Jamba (AI21), Zamba2 (Zyphra).

RWKV-7 — attempts to match transformers with an RNN; a candidate for mobile and embedded.


10. The Reasoning Lineage — CoT, ToT, Self-Consistency, GRPO

Chain-of-Thought Prompting (arXiv:2201.11903, Wei et al. 2022) — "Let's think step by step" doubles GSM8K accuracy.

Self-Consistency (arXiv:2203.11171) — sample many, majority-vote. +10-20% over single-sample on reasoning tasks.

Tree-of-Thoughts (arXiv:2305.10601) — search the reasoning tree. Effective on Game of 24 and creative writing.

Reflexion (arXiv:2303.11366) — log failures as text memory for the next attempt.

OpenAI o1 (blog, 2024-09) and DeepSeek-R1 GRPO — surface long CoT through RL during training. The reason every 2026 frontier model has a "thinking" mode.

Inference-Time Scaling Laws (arXiv:2408.03314) — at fixed compute, spending more on inference can beat spending more on parameters.

# Inference-time scaling — Best-of-N with a verifier
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

def best_of_n(prompt, n=16, verifier=None):
    inputs = tok(prompt, return_tensors="pt")
    candidates = []
    for _ in range(n):
        out = model.generate(
            **inputs,
            do_sample=True,
            temperature=0.8,
            max_new_tokens=512,
        )
        text = tok.decode(out[0], skip_special_tokens=True)
        score = verifier(text) if verifier else len(text)
        candidates.append((score, text))
    return max(candidates, key=lambda x: x[0])[1]

11. The RLHF Lineage — InstructGPT, Constitutional AI, DPO

InstructGPT (arXiv:2203.02155, Ouyang et al. 2022) — the canonical RLHF paper. PPO + reward model + KL penalty in three stages.

Constitutional AI (arXiv:2212.08073, Anthropic 2022) — replaces human preferences with an AI-authored constitution for self-critique. The origin of RLAIF.

DPO: Direct Preference Optimization (arXiv:2305.18290, Rafailov et al. 2023) — learns from preference pairs directly, no reward model. Eliminates PPO's machinery while matching performance. De facto standard since 2024.

ORPO (arXiv:2403.07691) — merges SFT and preference learning into one loss. Single-stage RLHF.

KTO: Kahneman-Tversky Optimization (arXiv:2402.01306) — trains from single labels (good/bad) instead of pairs. Lower labeling cost.

SimPO (arXiv:2405.14734) — drops the reference-model dependency of DPO. Memory savings.

Quick comparison:

AlgorithmReward modelReference modelLabel format
PPO (RLHF)yesyespair
DPOnoyespair
ORPOnonopair + SFT
KTOnoyessingle
SimPOnonopair

12. Agents — ReAct, Voyager, SWE-Agent, OS-Atlas

ReAct (arXiv:2210.03629) — interleave reasoning and acting. Foundation of almost every LLM agent framework.

Voyager (arXiv:2305.16291) — lifelong-learning agent in Minecraft. Auto-builds a skill library.

SWE-Agent (arXiv:2405.15793) — designs an agent-computer interface (ACI) rather than reusing a human IDE. Pushed GPT-4 on SWE-Bench from 12.5% to 18.0%.

OS-Atlas (arXiv:2410.23218) — grounding model for GUI agents. Screen capture to coordinates/actions.

Computer Use survey — after Anthropic's Claude Computer Use (2024-10) the field got a proper benchmark in OSWorld (arXiv:2404.07972).

# Minimal ReAct pseudo-code
def react_agent(task, tools, llm, max_steps=10):
    trajectory = [f"Task: {task}"]
    for step in range(max_steps):
        thought = llm(trajectory + ["Thought:"])
        action = llm(trajectory + ["Action:"])
        if action.startswith("Finish"):
            return action
        observation = tools.run(action)
        trajectory.append(f"Thought: {thought}\nAction: {action}\nObservation: {observation}")
    return "Max steps reached"

13. The RAG Lineage — From the Original to GraphRAG

RAG (Retrieval-Augmented Generation) (arXiv:2005.11401, Lewis et al. 2020) — combined retrieval and generation. Standard for open-domain QA.

FiD: Fusion-in-Decoder (arXiv:2007.01282) — fuses passages inside the decoder. Stronger than RAG but with higher decoder context cost.

RETRO (arXiv:2112.04426, DeepMind) — 2T-token data store outside the model; chunk-wise retrieval.

ColBERT / ColBERTv2 (arXiv:2004.12832) — late interaction. Token-level query-document matching, the accuracy standard for dense retrieval.

Self-RAG (arXiv:2310.11511) — the model decides whether to retrieve and emits self-reflection tokens.

GraphRAG (arXiv:2404.16130, Microsoft 2024) — turns documents into a knowledge graph and searches via community summaries. Strong on global queries (summary, trend).

Contextual Retrieval (Anthropic blog, 2024-09) — prepend a context prefix to each chunk before embedding. Cuts retrieval failure from 49% to 35%.


14. FlashAttention 1/2/3 — Rediscovering the Memory Hierarchy

FlashAttention (arXiv:2205.14135, Dao et al. 2022) — tile attention to stay inside SRAM. Cuts HBM I/O for a 7.6x speedup.

FlashAttention-2 (arXiv:2307.08691) — re-architected work partitioning. 2x faster. Most training stacks migrated.

FlashAttention-3 (arXiv:2407.08608) — exploits Hopper (H100/H200) async wgmma and TMA. 75% MFU on FP16, 1.2 PFLOPS on FP8.

# Calling FlashAttention from torch — 2026 standard
import torch
import torch.nn.functional as F

q = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)
k = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)
v = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)

# PyTorch 2.x SDPA picks the FlashAttention backend automatically
with torch.backends.cuda.sdp_kernel(
    enable_flash=True, enable_math=False, enable_mem_efficient=False
):
    out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
print(out.shape)  # [2, 8, 4096, 128]

15. vLLM and SGLang — Serving Infrastructure Standards

vLLM PagedAttention (arXiv:2309.06180, Kwon et al. 2023) — manages KV cache like OS paging. Memory fragmentation drops from ~90% to ~4%. Throughput 2-4x over HuggingFace TGI or NVIDIA Triton.

SGLang RadixAttention (arXiv:2312.07104) — shares KV cache in a radix tree. 5x faster on multi-turn or few-shot workloads with overlapping system prompts.

Mixture-of-Depths (arXiv:2404.02258, DeepMind 2024) — dynamically skips transformer layers per token. Same quality, fewer FLOPS.

Speculative Decoding (arXiv:2211.17192, Leviathan et al. 2022) — a small draft model proposes tokens, the large model verifies. Base 2-3x speedup.

# vLLM standard serving config — 2026 production pattern
docker run --gpus all -p 8000:8000 \
  -v ~/models:/models \
  vllm/vllm-openai:latest \
  --model /models/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --enable-chunked-prefill

16. Long Context — RoPE, YaRN, LongLoRA

RoPE: Rotary Positional Embedding (arXiv:2104.09864) — the Llama-family positional encoding standard.

YaRN (arXiv:2309.00071) — NTK-aware scaling of RoPE. Extends a 4K-trained model to 128K.

LongLoRA (arXiv:2309.12307) — sparse local attention plus LoRA for efficient context extension.

RingAttention (arXiv:2310.01889) — ring-topology KV communication across devices. Enables training at 1M+ context.

Activation Beacon (arXiv:2401.03462) — compress context into beacon tokens. Efficient retrieval.

Gemini 1.5 Pro at 1M tokens and Gemini 2.5 at 10M sit on combinations of these techniques.


17. Code LLMs — StarCoder, DeepSeek Coder, Codestral

StarCoder 2 (arXiv:2402.19173, BigCode 2024) — 619 programming languages, 4T+ tokens. Full weights and training data open.

DeepSeek Coder V2 (arXiv:2406.11931) — 236B MoE, 21B active. Matches GPT-4 Turbo on HumanEval and MBPP. V3 scales to 671B MoE.

Codestral (Mistral, 2024-05) — 22B, 80 languages, 32K context. Frequent pick for IDE integrations.

Code Llama (arXiv:2308.12950) — Code-specialized Llama 2 variant. Code Llama 70B briefly led open-weight coding.

Qwen2.5-Coder (32B) — Qwen's coding variant. Held #1 open SWE-Bench for a while.


18. Small Models — The SLM Renaissance

One of the bigger 2024-2026 shifts: "small can punch above its weight."

  • Phi-3.5 Mini (3.8B) — strong general model that runs on phones.
  • Gemma 2B / 3 1B — edge-friendly 1B-class.
  • Qwen2.5 3B / 7B — multilingual SLM standards.
  • Mistral 7B / Mistral Nemo 12B — classic-size standards.
  • SmolLM2 (arXiv:2502.02737) — 360M and 1.7B trained on 11T tokens. SmolLM-Corpus data catalog shipped alongside.
  • TinyLlama (arXiv:2401.02385) — 1.1B trained on 3T tokens.

In 2026 most mobile and embedded LLMs derive from this set.


19. Evaluation — From MMLU and HumanEval to SWE-Bench and OSWorld

Traditional benchmarks:

2024-2026 next generation:

In 2026 GSM8K and HumanEval are saturated on frontier models; the meaningful signal moved to SWE-Bench, OSWorld, GPQA, and ARC-AGI.


20. Main Model Comparison Table

ModelReleasedSizeMMLUHumanEvalGSM8KSWE-Bench
Llama 3.1 70B2024-0770B86.080.595.131.2
Llama 3.3 70B2024-1270B86.988.496.541.4
DeepSeek-V32024-12671B MoE88.589.089.342.0
DeepSeek-R12025-01671B MoE91.296.397.349.2
Qwen2.5-72B2024-0972B86.186.695.836.0
Mistral Large 22024-07123B84.092.093.032.0
Phi-42024-1214B84.882.680.4-
Gemma 3 27B2025-Q127B81.079.889.228.5
GPT-4o2024-05?88.790.295.833.2
Claude 4.72026?90.196.396.465+
Gemini 2.5 Pro2025?89.892.095.451.0

Numbers from each model card or the LMSYS / Open LLM Leaderboard averages. Don't read the table as a ranking — read it for "which axes saturate and which still have headroom each generation."


21. Safety and Alignment — Constitutional AI, Sycophancy, Refusal

Constitutional AI (arXiv:2212.08073) opened the door to reducing human labels in RLHF via model self-critique.

Discovering Language Model Behaviors with Model-Written Evaluations (arXiv:2212.09251) — measure subtle alignment failures like sycophancy using the model itself.

Universal and Transferable Adversarial Attacks on Aligned Language Models (arXiv:2307.15043, the GCG attack) — adversarial suffixes can break alignment in a systematic way.

Jailbreak Survey (arXiv:2402.13457) — taxonomy of jailbreaks through 2024.

Sleeper Agents (arXiv:2401.05566, Anthropic) — training-time backdoors survive standard safety training. An important paper on the limits of alignment.

Tamper-Resistant Safeguards (arXiv:2408.00761) — attempts to make open-weight safety robust to additional fine-tuning.


22. Korean Models — HyperCLOVA X, EXAONE 3.5, Kanana

HyperCLOVA X Technical Report (arXiv:2404.01954, Naver 2024) — Korean-English bilingual plus Korean culture, law, and medical evaluation sets (KoBigBench, KMMLU). The de facto baseline report for Korean LLMs.

EXAONE 3.5 (LG AI Research, 2024-12) — 2.4B / 7.8B / 32B. English-Korean bilingual, 32K context. Released under the EXAONE AI Model License rather than Apache 2.0, but research use is permitted.

Kanana (Kakao, 2025) — 2B / 8B / 32B. Korean and English. Internal LLM backbone for KakaoTalk.

KORAi / KORani / KoGPT / Polyglot-Ko — earlier Korean models. From 2025 the three above are the practical majors.

KMMLU (arXiv:2402.11548) — Korean MMLU. The default Korean-LLM evaluation.


23. Japanese Models — Sakana, Stockmark, Swallow, PLaMo

Sakana AI Evolutionary Optimization of Model Merging Recipes (arXiv:2403.13187) — evolution-algorithm-driven automatic multilingual model merging. EvoLLM-JP marked a new direction for Japanese LLMs.

Stockmark-100b (Stockmark, 2024) — 100B Japanese-English bilingual model trained on a Japanese business corpus.

Swallow (Tokyo Tech, arXiv:2404.17790) — continual pretraining of Llama 2/3 on Japanese corpora.

PLaMo 2 / 100B (Preferred Networks) — Japanese, English, code. PFN's own training corpus.

NEC cotomi — Japanese business-domain LLM. 130B and 7B variants.

Rakuten AI 7B, Karasu, Stable LM Japanese and other 7B-class Japanese models are plentiful.

JGLUE / Japanese MT-Bench — Japanese evaluation standards.


24. Data — Dolma, RedPajama, FineWeb

The three open-data majors.

  • Dolma (arXiv:2402.00159, AI2) — 3T tokens. Used for OLMo training.
  • RedPajama-Data-v2 (Together AI, 2023-10) — 30T tokens. Multilingual plus English.
  • FineWeb (arXiv:2406.17557, HuggingFace) — 15T tokens plus FineWeb-Edu 1.3T variant.

The Pile (arXiv:2101.00027, EleutherAI) — the 800GB starting point of open LLMs in 2021.

Common Crawl and the cleanup pipelines on top of it (CCNet, DataComp-LM, TxT360, Nemotron-CC) are the 2026 standards for open-data rationalization.


25. Multimodal — LLaVA, CogVLM, Qwen-VL, Pixtral

LLaVA (arXiv:2304.08485, 2023) — Vicuna plus a CLIP visual encoder plus a projection. The starting point of open multimodal.

LLaVA-1.5 / LLaVA-NeXT — better resolution handling and multi-turn.

Qwen-VL / Qwen2-VL (arXiv:2308.12966, arXiv:2409.12191) — arbitrary resolution and multilingual OCR. Qwen2.5-VL adds video.

Pixtral 12B (Mistral, 2024-09) — Pixtral's vision encoder handles arbitrary-resolution patches.

Idefics 3 (HuggingFace) — open data plus open weights multimodal.

Molmo (AI2, arXiv:2409.17146) — pointing as a training task. Strong fit for agent stacks.


26. Reading Order — A Curated 30 for 2026 Engineers

If you can only read 30, do them in this order:

  1. Llama 3 Technical Report — the full picture of modern LLM training.
  2. DeepSeek-V3 Technical Report — peak cost-efficient training.
  3. DeepSeek-R1 — RL-based reasoning.
  4. Mixtral of Experts — the MoE standard.
  5. DeepSeekMoE — fine-grained MoE.
  6. GQA and MLA — two axes of attention efficiency.
  7. FlashAttention-2 — the training-speed standard.
  8. vLLM PagedAttention — the serving standard.
  9. SGLang RadixAttention — KV cache sharing.
  10. CoT Prompting — the reasoning starting point.
  11. DPO — the post-training standard.
  12. Constitutional AI — origin of RLAIF.
  13. ReAct — the agent starting point.
  14. SWE-Agent — code-agent standard.
  15. OSWorld — computer-use evaluation.
  16. RAG (the original) — retrieval combination.
  17. ColBERTv2 — dense retrieval accuracy.
  18. GraphRAG — global RAG.
  19. Self-RAG — self-retrieval.
  20. YaRN — RoPE scaling.
  21. RingAttention — long-context training.
  22. Speculative Decoding — decoding acceleration.
  23. Phi-3 / Phi-4 — SLM renaissance.
  24. SmolLM2 — open SLM data.
  25. MMLU and GPQA — the evaluation baselines.
  26. SWE-Bench Verified — code evaluation.
  27. LMSYS Chatbot Arena — human preference.
  28. Sleeper Agents — limits of alignment.
  29. HyperCLOVA X — Korean LLM baseline.
  30. Sakana EvoLLM — model merging.

One paper a week for 30 weeks, or 30 days flat, gets the whole 2026 LLM landscape into your head.


References