필사 모드: Top LLM Papers 2024-2026 - Llama, DeepSeek, Qwen, Mistral, Phi, RLHF, DPO, CoT, RAG, FlashAttention, vLLM Reading List
EnglishPrologue — Surviving the 2026 LLM Paper Firehose
Between January 2024 and May 2026, arXiv `cs.CL` and `cs.LG` averaged over 1,200 new submissions per week. Filter down to LLM-specific work and you still get ~300 a week, ~15,000 a year. No single person can read it all.
The question a working engineer asks in 2026 is therefore simple: **"which 30 papers actually help the system I'm building today?"**
This post curates that 30 plus a margin. Three criteria:
- **Reproducible** — code, weights, or enough detail to rebuild.
- **Cited in the field** — referenced in model cards, benchmark reports, production blog posts.
- **Durable** — the core insight survives the next model release six months out.
> One-line summary: read in this order — **foundation model reports → MoE / attention innovations → RLHF and DPO family → CoT and reasoning → agents and retrieval → FlashAttention and serving → evaluation and safety**. One week and the whole 2026 LLM landscape is in your head.
1. Llama 3 — the New Open-Weight Baseline
**Llama 3 / Llama 3.3 Technical Report** (2024-07, [arXiv:2407.21783](https://arxiv.org/abs/2407.21783))
Meta released Llama 3 across 8B, 70B, and 405B and effectively reset the open-weight baseline. The 92-page technical report documents the **data curation pipeline** (15T tokens), **scaling law re-validation**, **post-training recipe** (SFT + DPO + Rejection Sampling), and **infrastructure** (a 16K H100 cluster with 419 interruptions; the most common failures were GPU, then memory, then NIC). A single report is the de facto textbook on how a modern LLM is built. The 8B variant is still, in 2026, the most common fine-tuning base.
Llama 3.3 70B kept the architecture and only strengthened post-training, reaching GPT-4o-level instruction following. With Llama 4 shipping a multimodal MoE in mid-2025, "Llama equals open LLM standard" is now the working assumption.
2. DeepSeek-V3 and R1 — Peak MoE and Reasoning RL
**DeepSeek-V3 Technical Report** (2024-12, [arXiv:2412.19437](https://arxiv.org/abs/2412.19437))
A 671B-parameter MoE trained on 14.8T tokens, reportedly for ~$5.58M of H800 time. That headline number shook the industry. The technical contributions worth knowing: **MLA (Multi-head Latent Attention)** compresses KV cache by ~10x; **DeepSeekMoE** uses 256 routed experts plus 1 shared expert; **auxiliary-loss-free load balancing**, **FP8 training**, and **DualPipe pipeline parallelism** are now standard references for follow-on open models.
**DeepSeek-R1** (2025-01, [arXiv:2501.12948](https://arxiv.org/abs/2501.12948))
R1 takes V3 as the base and reaches o1-class reasoning **with pure RL**. The key algorithm is **GRPO (Group Relative Policy Optimization)**, which drops PPO's value network to save memory. The R1-Zero report — pure RL, no SFT — describes an "aha moment" where the model starts emitting self-review tokens like "Wait, let me reconsider…", one of the most-cited results of 2025.
3. The Qwen Series — Trilingual Strength from China
**Qwen2.5 Technical Report** (2024-12, [arXiv:2412.15115](https://arxiv.org/abs/2412.15115)) and **Qwen3 Technical Report** (2025-Q2) cover sizes from 0.5B up to 72B, plus 128K-context, multimodal, math- and code-specialized variants. The Qwen family often beats Llama on **CJK (Chinese, Japanese, Korean)** workloads, and Qwen2.5-Coder 32B held the top SWE-Bench score among open-weight coding models for some time. In 2026 it is the most common base for Korean and Japanese startups training their own models.
4. Mistral and Mistral Large 2 — Europe Responds
**Mistral 7B** (2023-10, [arXiv:2310.06825](https://arxiv.org/abs/2310.06825)) combined sliding-window attention and grouped-query attention to beat Llama 2 13B at 7B size — a milestone. In 2024 **Mistral Large 2** (123B) and in 2025 **Mistral Medium 3** shipped under Apache 2.0 or Mistral Research License, anchoring the European open-weight position. **Mixtral 8x7B** and **Mixtral 8x22B** defined the sparse-MoE standard before DeepSeek; **Codestral** at 22B is still a common code-specific pick.
5. The Phi Series — "Data Quality Equals Model Quality"
**Phi-3 Technical Report** (2024-04, [arXiv:2404.14219](https://arxiv.org/abs/2404.14219)) and **Phi-4** (2024-12, [arXiv:2412.08905](https://arxiv.org/abs/2412.08905)) are the high point of the Microsoft Research SLM (small language model) line. The thesis is simple: **train only on "textbook quality data"** and a 3.8B model can beat GPT-3.5. Phi-4 at 14B caught Llama 3 70B on GPQA and MATH, and **Phi-4-reasoning** showed an o1-mini-class reasoner — evidence that SLMs can reason too.
6. Gemma 3 and Falcon 3 — The Rest of the Open Camp
**Gemma 3 Technical Report** (2025-Q1) ships 1B / 4B / 12B / 27B and ports some Gemini 2.0 internals (attention variants, distillation) to open weights. 128K context and multimodality come built in.
**Falcon 3** (TII, UAE) and **Command R+** (Cohere) emphasize English, Arabic, and multilingual RAG rather than CJK. **Yi-Lightning** (01.AI) and **GLM-4-9B** (Zhipu) are less known outside China but show up high on Chatbot Arena.
7. Commercial Model Cards — GPT-4, Claude 4.7, Gemini 2.5
For closed models the **system card** is the source of record, not the paper.
- **GPT-4 Technical Report** (2023, [arXiv:2303.08774](https://arxiv.org/abs/2303.08774)) — architecture details withheld but the evaluation methodology and safety procedures set a baseline.
- **OpenAI o1 System Card** (2024-09) — the first commercial reasoning model, RL plus CoT integrated at training.
- **OpenAI o3 / o4 System Card** (2025) — the first model to clear average human on ARC-AGI.
- **Anthropic Claude 4 / 4.5 / 4.7 Model Card** — successors to Constitutional AI, sycophancy mitigation, citation features, computer use capability descriptions.
- **Google Gemini 1.5 / 2.0 / 2.5 Technical Report** ([arXiv:2403.05530](https://arxiv.org/abs/2403.05530)) — 1M to 10M token context with native multimodality.
You read commercial cards for **evaluation methodology, safety interventions, and limitations**, not for benchmark numbers.
8. Mixture-of-Experts — Switch Transformer to DeepSeekMoE
MoE re-emerged in 2021 with **Switch Transformer** ([arXiv:2101.03961](https://arxiv.org/abs/2101.03961)), continued through **GShard**, **GLaM**, and **ST-MoE**, and stepped up again in 2024 with **DeepSeekMoE** ([arXiv:2401.06066](https://arxiv.org/abs/2401.06066)). Two ideas matter: **fine-grained expert segmentation** (more, smaller experts) and **shared expert isolation** (separate experts for common knowledge). DeepSeek-V3's 256+1 expert configuration follows directly.
**Mixtral of Experts** ([arXiv:2401.04088](https://arxiv.org/abs/2401.04088)) activates top-2 of 8 experts and is the most cited sparse-MoE implementation. **OLMoE** (Allen AI) was the first MoE to publish full training code and data.
9. Attention Innovations — MLA, GQA, Sliding Window, Mamba
**GQA: Grouped-Query Attention** ([arXiv:2305.13245](https://arxiv.org/abs/2305.13245)) — multiple query heads share KV heads. Default in Llama 2/3, Mistral, and nearly every modern model.
**MLA: Multi-head Latent Attention** ([arXiv:2405.04434](https://arxiv.org/abs/2405.04434), DeepSeek-V2 paper) — low-rank compression of KV cache. ~80% memory savings at the same context.
**Sliding Window Attention** — used by Longformer ([arXiv:2004.05150](https://arxiv.org/abs/2004.05150)) and Mistral 7B. Local window plus global tokens.
**Mamba / Mamba-2** ([arXiv:2312.00752](https://arxiv.org/abs/2312.00752), [arXiv:2405.21060](https://arxiv.org/abs/2405.21060)) — SSM (state-space model) based. O(N) instead of O(N²) attention. Throughput wins on long context. Hybrid stacks (transformer + Mamba blocks) appeared experimentally in 2025-2026 — **Jamba** (AI21), **Zamba2** (Zyphra).
**RWKV-7** — attempts to match transformers with an RNN; a candidate for mobile and embedded.
10. The Reasoning Lineage — CoT, ToT, Self-Consistency, GRPO
**Chain-of-Thought Prompting** ([arXiv:2201.11903](https://arxiv.org/abs/2201.11903), Wei et al. 2022) — "Let's think step by step" doubles GSM8K accuracy.
**Self-Consistency** ([arXiv:2203.11171](https://arxiv.org/abs/2203.11171)) — sample many, majority-vote. +10-20% over single-sample on reasoning tasks.
**Tree-of-Thoughts** ([arXiv:2305.10601](https://arxiv.org/abs/2305.10601)) — search the reasoning tree. Effective on Game of 24 and creative writing.
**Reflexion** ([arXiv:2303.11366](https://arxiv.org/abs/2303.11366)) — log failures as text memory for the next attempt.
**OpenAI o1** (blog, 2024-09) and **DeepSeek-R1 GRPO** — surface long CoT through RL during training. The reason every 2026 frontier model has a "thinking" mode.
**Inference-Time Scaling Laws** ([arXiv:2408.03314](https://arxiv.org/abs/2408.03314)) — at fixed compute, spending more on inference can beat spending more on parameters.
Inference-time scaling — Best-of-N with a verifier
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
def best_of_n(prompt, n=16, verifier=None):
inputs = tok(prompt, return_tensors="pt")
candidates = []
for _ in range(n):
out = model.generate(
**inputs,
do_sample=True,
temperature=0.8,
max_new_tokens=512,
)
text = tok.decode(out[0], skip_special_tokens=True)
score = verifier(text) if verifier else len(text)
candidates.append((score, text))
return max(candidates, key=lambda x: x[0])[1]
11. The RLHF Lineage — InstructGPT, Constitutional AI, DPO
**InstructGPT** ([arXiv:2203.02155](https://arxiv.org/abs/2203.02155), Ouyang et al. 2022) — the canonical RLHF paper. PPO + reward model + KL penalty in three stages.
**Constitutional AI** ([arXiv:2212.08073](https://arxiv.org/abs/2212.08073), Anthropic 2022) — replaces human preferences with an **AI-authored constitution** for self-critique. The origin of RLAIF.
**DPO: Direct Preference Optimization** ([arXiv:2305.18290](https://arxiv.org/abs/2305.18290), Rafailov et al. 2023) — learns from preference pairs directly, no reward model. Eliminates PPO's machinery while matching performance. De facto standard since 2024.
**ORPO** ([arXiv:2403.07691](https://arxiv.org/abs/2403.07691)) — merges SFT and preference learning into one loss. Single-stage RLHF.
**KTO: Kahneman-Tversky Optimization** ([arXiv:2402.01306](https://arxiv.org/abs/2402.01306)) — trains from single labels (good/bad) instead of pairs. Lower labeling cost.
**SimPO** ([arXiv:2405.14734](https://arxiv.org/abs/2405.14734)) — drops the reference-model dependency of DPO. Memory savings.
Quick comparison:
| Algorithm | Reward model | Reference model | Label format |
| --- | --- | --- | --- |
| PPO (RLHF) | yes | yes | pair |
| DPO | no | yes | pair |
| ORPO | no | no | pair + SFT |
| KTO | no | yes | single |
| SimPO | no | no | pair |
12. Agents — ReAct, Voyager, SWE-Agent, OS-Atlas
**ReAct** ([arXiv:2210.03629](https://arxiv.org/abs/2210.03629)) — interleave reasoning and acting. Foundation of almost every LLM agent framework.
**Voyager** ([arXiv:2305.16291](https://arxiv.org/abs/2305.16291)) — lifelong-learning agent in Minecraft. Auto-builds a skill library.
**SWE-Agent** ([arXiv:2405.15793](https://arxiv.org/abs/2405.15793)) — designs an **agent-computer interface (ACI)** rather than reusing a human IDE. Pushed GPT-4 on SWE-Bench from 12.5% to 18.0%.
**OS-Atlas** ([arXiv:2410.23218](https://arxiv.org/abs/2410.23218)) — grounding model for GUI agents. Screen capture to coordinates/actions.
**Computer Use survey** — after Anthropic's Claude Computer Use (2024-10) the field got a proper benchmark in **OSWorld** ([arXiv:2404.07972](https://arxiv.org/abs/2404.07972)).
Minimal ReAct pseudo-code
def react_agent(task, tools, llm, max_steps=10):
trajectory = [f"Task: {task}"]
for step in range(max_steps):
thought = llm(trajectory + ["Thought:"])
action = llm(trajectory + ["Action:"])
if action.startswith("Finish"):
return action
observation = tools.run(action)
trajectory.append(f"Thought: {thought}\nAction: {action}\nObservation: {observation}")
return "Max steps reached"
13. The RAG Lineage — From the Original to GraphRAG
**RAG (Retrieval-Augmented Generation)** ([arXiv:2005.11401](https://arxiv.org/abs/2005.11401), Lewis et al. 2020) — combined retrieval and generation. Standard for open-domain QA.
**FiD: Fusion-in-Decoder** ([arXiv:2007.01282](https://arxiv.org/abs/2007.01282)) — fuses passages inside the decoder. Stronger than RAG but with higher decoder context cost.
**RETRO** ([arXiv:2112.04426](https://arxiv.org/abs/2112.04426), DeepMind) — 2T-token data store outside the model; chunk-wise retrieval.
**ColBERT / ColBERTv2** ([arXiv:2004.12832](https://arxiv.org/abs/2004.12832)) — late interaction. Token-level query-document matching, the accuracy standard for dense retrieval.
**Self-RAG** ([arXiv:2310.11511](https://arxiv.org/abs/2310.11511)) — the model decides whether to retrieve and emits self-reflection tokens.
**GraphRAG** ([arXiv:2404.16130](https://arxiv.org/abs/2404.16130), Microsoft 2024) — turns documents into a knowledge graph and searches via community summaries. Strong on global queries (summary, trend).
**Contextual Retrieval** (Anthropic blog, 2024-09) — prepend a context prefix to each chunk before embedding. Cuts retrieval failure from 49% to 35%.
14. FlashAttention 1/2/3 — Rediscovering the Memory Hierarchy
**FlashAttention** ([arXiv:2205.14135](https://arxiv.org/abs/2205.14135), Dao et al. 2022) — tile attention to stay inside SRAM. Cuts HBM I/O for a 7.6x speedup.
**FlashAttention-2** ([arXiv:2307.08691](https://arxiv.org/abs/2307.08691)) — re-architected work partitioning. 2x faster. Most training stacks migrated.
**FlashAttention-3** ([arXiv:2407.08608](https://arxiv.org/abs/2407.08608)) — exploits Hopper (H100/H200) async wgmma and TMA. 75% MFU on FP16, 1.2 PFLOPS on FP8.
Calling FlashAttention from torch — 2026 standard
q = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)
k = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)
v = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)
PyTorch 2.x SDPA picks the FlashAttention backend automatically
with torch.backends.cuda.sdp_kernel(
enable_flash=True, enable_math=False, enable_mem_efficient=False
):
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
print(out.shape) # [2, 8, 4096, 128]
15. vLLM and SGLang — Serving Infrastructure Standards
**vLLM PagedAttention** ([arXiv:2309.06180](https://arxiv.org/abs/2309.06180), Kwon et al. 2023) — manages KV cache like OS paging. Memory fragmentation drops from ~90% to ~4%. Throughput 2-4x over HuggingFace TGI or NVIDIA Triton.
**SGLang RadixAttention** ([arXiv:2312.07104](https://arxiv.org/abs/2312.07104)) — shares KV cache in a radix tree. 5x faster on multi-turn or few-shot workloads with overlapping system prompts.
**Mixture-of-Depths** ([arXiv:2404.02258](https://arxiv.org/abs/2404.02258), DeepMind 2024) — dynamically skips transformer layers per token. Same quality, fewer FLOPS.
**Speculative Decoding** ([arXiv:2211.17192](https://arxiv.org/abs/2211.17192), Leviathan et al. 2022) — a small draft model proposes tokens, the large model verifies. Base 2-3x speedup.
vLLM standard serving config — 2026 production pattern
docker run --gpus all -p 8000:8000 \
-v ~/models:/models \
vllm/vllm-openai:latest \
--model /models/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--enable-chunked-prefill
16. Long Context — RoPE, YaRN, LongLoRA
**RoPE: Rotary Positional Embedding** ([arXiv:2104.09864](https://arxiv.org/abs/2104.09864)) — the Llama-family positional encoding standard.
**YaRN** ([arXiv:2309.00071](https://arxiv.org/abs/2309.00071)) — NTK-aware scaling of RoPE. Extends a 4K-trained model to 128K.
**LongLoRA** ([arXiv:2309.12307](https://arxiv.org/abs/2309.12307)) — sparse local attention plus LoRA for efficient context extension.
**RingAttention** ([arXiv:2310.01889](https://arxiv.org/abs/2310.01889)) — ring-topology KV communication across devices. Enables training at 1M+ context.
**Activation Beacon** ([arXiv:2401.03462](https://arxiv.org/abs/2401.03462)) — compress context into beacon tokens. Efficient retrieval.
Gemini 1.5 Pro at 1M tokens and Gemini 2.5 at 10M sit on combinations of these techniques.
17. Code LLMs — StarCoder, DeepSeek Coder, Codestral
**StarCoder 2** ([arXiv:2402.19173](https://arxiv.org/abs/2402.19173), BigCode 2024) — 619 programming languages, 4T+ tokens. Full weights and training data open.
**DeepSeek Coder V2** ([arXiv:2406.11931](https://arxiv.org/abs/2406.11931)) — 236B MoE, 21B active. Matches GPT-4 Turbo on HumanEval and MBPP. V3 scales to 671B MoE.
**Codestral** (Mistral, 2024-05) — 22B, 80 languages, 32K context. Frequent pick for IDE integrations.
**Code Llama** ([arXiv:2308.12950](https://arxiv.org/abs/2308.12950)) — Code-specialized Llama 2 variant. Code Llama 70B briefly led open-weight coding.
**Qwen2.5-Coder** (32B) — Qwen's coding variant. Held #1 open SWE-Bench for a while.
18. Small Models — The SLM Renaissance
One of the bigger 2024-2026 shifts: **"small can punch above its weight."**
- **Phi-3.5 Mini** (3.8B) — strong general model that runs on phones.
- **Gemma 2B / 3 1B** — edge-friendly 1B-class.
- **Qwen2.5 3B / 7B** — multilingual SLM standards.
- **Mistral 7B / Mistral Nemo 12B** — classic-size standards.
- **SmolLM2** ([arXiv:2502.02737](https://arxiv.org/abs/2502.02737)) — 360M and 1.7B trained on 11T tokens. SmolLM-Corpus data catalog shipped alongside.
- **TinyLlama** ([arXiv:2401.02385](https://arxiv.org/abs/2401.02385)) — 1.1B trained on 3T tokens.
In 2026 most mobile and embedded LLMs derive from this set.
19. Evaluation — From MMLU and HumanEval to SWE-Bench and OSWorld
Traditional benchmarks:
- **MMLU** ([arXiv:2009.03300](https://arxiv.org/abs/2009.03300)) — 57-domain multiple choice.
- **GSM8K** ([arXiv:2110.14168](https://arxiv.org/abs/2110.14168)) — grade-school math.
- **MATH** ([arXiv:2103.03874](https://arxiv.org/abs/2103.03874)) — competition math.
- **HumanEval** ([arXiv:2107.03374](https://arxiv.org/abs/2107.03374)) — code completion.
- **BIG-Bench Hard** ([arXiv:2210.09261](https://arxiv.org/abs/2210.09261)).
2024-2026 next generation:
- **GPQA** ([arXiv:2311.12022](https://arxiv.org/abs/2311.12022)) — PhD-level STEM.
- **MMLU-Pro** ([arXiv:2406.01574](https://arxiv.org/abs/2406.01574)) — shuffled answers and harder MMLU.
- **ARC-AGI** (Chollet) — general-intelligence test. o3 first to clear average human.
- **SWE-Bench** ([arXiv:2310.06770](https://arxiv.org/abs/2310.06770)) and **SWE-Bench Verified** — real GitHub issue resolution.
- **OSWorld** ([arXiv:2404.07972](https://arxiv.org/abs/2404.07972)) — computer-use agents.
- **MMMU** ([arXiv:2311.16502](https://arxiv.org/abs/2311.16502)) — multimodal multiple choice.
- **LMSYS Chatbot Arena** ([arXiv:2403.04132](https://arxiv.org/abs/2403.04132)) — human pairwise vote, ELO.
In 2026 GSM8K and HumanEval are saturated on frontier models; the meaningful signal moved to SWE-Bench, OSWorld, GPQA, and ARC-AGI.
20. Main Model Comparison Table
| Model | Released | Size | MMLU | HumanEval | GSM8K | SWE-Bench |
| --- | --- | --- | --- | --- | --- | --- |
| Llama 3.1 70B | 2024-07 | 70B | 86.0 | 80.5 | 95.1 | 31.2 |
| Llama 3.3 70B | 2024-12 | 70B | 86.9 | 88.4 | 96.5 | 41.4 |
| DeepSeek-V3 | 2024-12 | 671B MoE | 88.5 | 89.0 | 89.3 | 42.0 |
| DeepSeek-R1 | 2025-01 | 671B MoE | 91.2 | 96.3 | 97.3 | 49.2 |
| Qwen2.5-72B | 2024-09 | 72B | 86.1 | 86.6 | 95.8 | 36.0 |
| Mistral Large 2 | 2024-07 | 123B | 84.0 | 92.0 | 93.0 | 32.0 |
| Phi-4 | 2024-12 | 14B | 84.8 | 82.6 | 80.4 | - |
| Gemma 3 27B | 2025-Q1 | 27B | 81.0 | 79.8 | 89.2 | 28.5 |
| GPT-4o | 2024-05 | ? | 88.7 | 90.2 | 95.8 | 33.2 |
| Claude 4.7 | 2026 | ? | 90.1 | 96.3 | 96.4 | 65+ |
| Gemini 2.5 Pro | 2025 | ? | 89.8 | 92.0 | 95.4 | 51.0 |
Numbers from each model card or the LMSYS / Open LLM Leaderboard averages. Don't read the table as a ranking — read it for "which axes saturate and which still have headroom each generation."
21. Safety and Alignment — Constitutional AI, Sycophancy, Refusal
**Constitutional AI** ([arXiv:2212.08073](https://arxiv.org/abs/2212.08073)) opened the door to reducing human labels in RLHF via model self-critique.
**Discovering Language Model Behaviors with Model-Written Evaluations** ([arXiv:2212.09251](https://arxiv.org/abs/2212.09251)) — measure subtle alignment failures like sycophancy using the model itself.
**Universal and Transferable Adversarial Attacks on Aligned Language Models** ([arXiv:2307.15043](https://arxiv.org/abs/2307.15043), the GCG attack) — adversarial suffixes can break alignment in a systematic way.
**Jailbreak Survey** ([arXiv:2402.13457](https://arxiv.org/abs/2402.13457)) — taxonomy of jailbreaks through 2024.
**Sleeper Agents** ([arXiv:2401.05566](https://arxiv.org/abs/2401.05566), Anthropic) — training-time backdoors survive standard safety training. An important paper on the limits of alignment.
**Tamper-Resistant Safeguards** ([arXiv:2408.00761](https://arxiv.org/abs/2408.00761)) — attempts to make open-weight safety robust to additional fine-tuning.
22. Korean Models — HyperCLOVA X, EXAONE 3.5, Kanana
**HyperCLOVA X Technical Report** ([arXiv:2404.01954](https://arxiv.org/abs/2404.01954), Naver 2024) — Korean-English bilingual plus Korean culture, law, and medical evaluation sets (KoBigBench, KMMLU). The de facto baseline report for Korean LLMs.
**EXAONE 3.5** (LG AI Research, 2024-12) — 2.4B / 7.8B / 32B. English-Korean bilingual, 32K context. Released under the EXAONE AI Model License rather than Apache 2.0, but research use is permitted.
**Kanana** (Kakao, 2025) — 2B / 8B / 32B. Korean and English. Internal LLM backbone for KakaoTalk.
**KORAi / KORani / KoGPT / Polyglot-Ko** — earlier Korean models. From 2025 the three above are the practical majors.
**KMMLU** ([arXiv:2402.11548](https://arxiv.org/abs/2402.11548)) — Korean MMLU. The default Korean-LLM evaluation.
23. Japanese Models — Sakana, Stockmark, Swallow, PLaMo
**Sakana AI Evolutionary Optimization of Model Merging Recipes** ([arXiv:2403.13187](https://arxiv.org/abs/2403.13187)) — evolution-algorithm-driven automatic multilingual model merging. EvoLLM-JP marked a new direction for Japanese LLMs.
**Stockmark-100b** (Stockmark, 2024) — 100B Japanese-English bilingual model trained on a Japanese business corpus.
**Swallow** (Tokyo Tech, [arXiv:2404.17790](https://arxiv.org/abs/2404.17790)) — continual pretraining of Llama 2/3 on Japanese corpora.
**PLaMo 2 / 100B** (Preferred Networks) — Japanese, English, code. PFN's own training corpus.
**NEC cotomi** — Japanese business-domain LLM. 130B and 7B variants.
**Rakuten AI 7B**, **Karasu**, **Stable LM Japanese** and other 7B-class Japanese models are plentiful.
**JGLUE / Japanese MT-Bench** — Japanese evaluation standards.
24. Data — Dolma, RedPajama, FineWeb
The three open-data majors.
- **Dolma** ([arXiv:2402.00159](https://arxiv.org/abs/2402.00159), AI2) — 3T tokens. Used for OLMo training.
- **RedPajama-Data-v2** (Together AI, 2023-10) — 30T tokens. Multilingual plus English.
- **FineWeb** ([arXiv:2406.17557](https://arxiv.org/abs/2406.17557), HuggingFace) — 15T tokens plus FineWeb-Edu 1.3T variant.
**The Pile** ([arXiv:2101.00027](https://arxiv.org/abs/2101.00027), EleutherAI) — the 800GB starting point of open LLMs in 2021.
Common Crawl and the cleanup pipelines on top of it (CCNet, DataComp-LM, **TxT360**, **Nemotron-CC**) are the 2026 standards for open-data rationalization.
25. Multimodal — LLaVA, CogVLM, Qwen-VL, Pixtral
**LLaVA** ([arXiv:2304.08485](https://arxiv.org/abs/2304.08485), 2023) — Vicuna plus a CLIP visual encoder plus a projection. The starting point of open multimodal.
**LLaVA-1.5 / LLaVA-NeXT** — better resolution handling and multi-turn.
**Qwen-VL / Qwen2-VL** ([arXiv:2308.12966](https://arxiv.org/abs/2308.12966), [arXiv:2409.12191](https://arxiv.org/abs/2409.12191)) — arbitrary resolution and multilingual OCR. Qwen2.5-VL adds video.
**Pixtral 12B** (Mistral, 2024-09) — Pixtral's vision encoder handles arbitrary-resolution patches.
**Idefics 3** (HuggingFace) — open data plus open weights multimodal.
**Molmo** (AI2, [arXiv:2409.17146](https://arxiv.org/abs/2409.17146)) — pointing as a training task. Strong fit for agent stacks.
26. Reading Order — A Curated 30 for 2026 Engineers
If you can only read 30, do them in this order:
1. Llama 3 Technical Report — the full picture of modern LLM training.
2. DeepSeek-V3 Technical Report — peak cost-efficient training.
3. DeepSeek-R1 — RL-based reasoning.
4. Mixtral of Experts — the MoE standard.
5. DeepSeekMoE — fine-grained MoE.
6. GQA and MLA — two axes of attention efficiency.
7. FlashAttention-2 — the training-speed standard.
8. vLLM PagedAttention — the serving standard.
9. SGLang RadixAttention — KV cache sharing.
10. CoT Prompting — the reasoning starting point.
11. DPO — the post-training standard.
12. Constitutional AI — origin of RLAIF.
13. ReAct — the agent starting point.
14. SWE-Agent — code-agent standard.
15. OSWorld — computer-use evaluation.
16. RAG (the original) — retrieval combination.
17. ColBERTv2 — dense retrieval accuracy.
18. GraphRAG — global RAG.
19. Self-RAG — self-retrieval.
20. YaRN — RoPE scaling.
21. RingAttention — long-context training.
22. Speculative Decoding — decoding acceleration.
23. Phi-3 / Phi-4 — SLM renaissance.
24. SmolLM2 — open SLM data.
25. MMLU and GPQA — the evaluation baselines.
26. SWE-Bench Verified — code evaluation.
27. LMSYS Chatbot Arena — human preference.
28. Sleeper Agents — limits of alignment.
29. HyperCLOVA X — Korean LLM baseline.
30. Sakana EvoLLM — model merging.
One paper a week for 30 weeks, or 30 days flat, gets the whole 2026 LLM landscape into your head.
References
- arxiv.org — [https://arxiv.org/](https://arxiv.org/)
- Llama 3 Technical Report — [https://arxiv.org/abs/2407.21783](https://arxiv.org/abs/2407.21783)
- DeepSeek-V3 Technical Report — [https://arxiv.org/abs/2412.19437](https://arxiv.org/abs/2412.19437)
- DeepSeek-R1 — [https://arxiv.org/abs/2501.12948](https://arxiv.org/abs/2501.12948)
- Qwen2.5 Technical Report — [https://arxiv.org/abs/2412.15115](https://arxiv.org/abs/2412.15115)
- Mistral 7B — [https://arxiv.org/abs/2310.06825](https://arxiv.org/abs/2310.06825)
- Mixtral of Experts — [https://arxiv.org/abs/2401.04088](https://arxiv.org/abs/2401.04088)
- Phi-3 Technical Report — [https://arxiv.org/abs/2404.14219](https://arxiv.org/abs/2404.14219)
- Phi-4 — [https://arxiv.org/abs/2412.08905](https://arxiv.org/abs/2412.08905)
- Gemini 1.5 — [https://arxiv.org/abs/2403.05530](https://arxiv.org/abs/2403.05530)
- Switch Transformer — [https://arxiv.org/abs/2101.03961](https://arxiv.org/abs/2101.03961)
- DeepSeekMoE — [https://arxiv.org/abs/2401.06066](https://arxiv.org/abs/2401.06066)
- GQA — [https://arxiv.org/abs/2305.13245](https://arxiv.org/abs/2305.13245)
- MLA / DeepSeek-V2 — [https://arxiv.org/abs/2405.04434](https://arxiv.org/abs/2405.04434)
- Mamba — [https://arxiv.org/abs/2312.00752](https://arxiv.org/abs/2312.00752)
- Mamba-2 — [https://arxiv.org/abs/2405.21060](https://arxiv.org/abs/2405.21060)
- Chain-of-Thought — [https://arxiv.org/abs/2201.11903](https://arxiv.org/abs/2201.11903)
- Self-Consistency — [https://arxiv.org/abs/2203.11171](https://arxiv.org/abs/2203.11171)
- Tree-of-Thoughts — [https://arxiv.org/abs/2305.10601](https://arxiv.org/abs/2305.10601)
- Inference-Time Scaling — [https://arxiv.org/abs/2408.03314](https://arxiv.org/abs/2408.03314)
- InstructGPT — [https://arxiv.org/abs/2203.02155](https://arxiv.org/abs/2203.02155)
- Constitutional AI — [https://arxiv.org/abs/2212.08073](https://arxiv.org/abs/2212.08073)
- DPO — [https://arxiv.org/abs/2305.18290](https://arxiv.org/abs/2305.18290)
- ORPO — [https://arxiv.org/abs/2403.07691](https://arxiv.org/abs/2403.07691)
- KTO — [https://arxiv.org/abs/2402.01306](https://arxiv.org/abs/2402.01306)
- SimPO — [https://arxiv.org/abs/2405.14734](https://arxiv.org/abs/2405.14734)
- ReAct — [https://arxiv.org/abs/2210.03629](https://arxiv.org/abs/2210.03629)
- Voyager — [https://arxiv.org/abs/2305.16291](https://arxiv.org/abs/2305.16291)
- SWE-Agent — [https://arxiv.org/abs/2405.15793](https://arxiv.org/abs/2405.15793)
- OS-Atlas — [https://arxiv.org/abs/2410.23218](https://arxiv.org/abs/2410.23218)
- OSWorld — [https://arxiv.org/abs/2404.07972](https://arxiv.org/abs/2404.07972)
- RAG — [https://arxiv.org/abs/2005.11401](https://arxiv.org/abs/2005.11401)
- FiD — [https://arxiv.org/abs/2007.01282](https://arxiv.org/abs/2007.01282)
- RETRO — [https://arxiv.org/abs/2112.04426](https://arxiv.org/abs/2112.04426)
- ColBERT — [https://arxiv.org/abs/2004.12832](https://arxiv.org/abs/2004.12832)
- Self-RAG — [https://arxiv.org/abs/2310.11511](https://arxiv.org/abs/2310.11511)
- GraphRAG — [https://arxiv.org/abs/2404.16130](https://arxiv.org/abs/2404.16130)
- FlashAttention — [https://arxiv.org/abs/2205.14135](https://arxiv.org/abs/2205.14135)
- FlashAttention-2 — [https://arxiv.org/abs/2307.08691](https://arxiv.org/abs/2307.08691)
- FlashAttention-3 — [https://arxiv.org/abs/2407.08608](https://arxiv.org/abs/2407.08608)
- vLLM PagedAttention — [https://arxiv.org/abs/2309.06180](https://arxiv.org/abs/2309.06180)
- SGLang — [https://arxiv.org/abs/2312.07104](https://arxiv.org/abs/2312.07104)
- Speculative Decoding — [https://arxiv.org/abs/2211.17192](https://arxiv.org/abs/2211.17192)
- Mixture-of-Depths — [https://arxiv.org/abs/2404.02258](https://arxiv.org/abs/2404.02258)
- RoPE — [https://arxiv.org/abs/2104.09864](https://arxiv.org/abs/2104.09864)
- YaRN — [https://arxiv.org/abs/2309.00071](https://arxiv.org/abs/2309.00071)
- LongLoRA — [https://arxiv.org/abs/2309.12307](https://arxiv.org/abs/2309.12307)
- RingAttention — [https://arxiv.org/abs/2310.01889](https://arxiv.org/abs/2310.01889)
- Activation Beacon — [https://arxiv.org/abs/2401.03462](https://arxiv.org/abs/2401.03462)
- StarCoder 2 — [https://arxiv.org/abs/2402.19173](https://arxiv.org/abs/2402.19173)
- DeepSeek Coder V2 — [https://arxiv.org/abs/2406.11931](https://arxiv.org/abs/2406.11931)
- Code Llama — [https://arxiv.org/abs/2308.12950](https://arxiv.org/abs/2308.12950)
- MMLU — [https://arxiv.org/abs/2009.03300](https://arxiv.org/abs/2009.03300)
- GSM8K — [https://arxiv.org/abs/2110.14168](https://arxiv.org/abs/2110.14168)
- MATH — [https://arxiv.org/abs/2103.03874](https://arxiv.org/abs/2103.03874)
- HumanEval — [https://arxiv.org/abs/2107.03374](https://arxiv.org/abs/2107.03374)
- GPQA — [https://arxiv.org/abs/2311.12022](https://arxiv.org/abs/2311.12022)
- SWE-Bench — [https://arxiv.org/abs/2310.06770](https://arxiv.org/abs/2310.06770)
- MMMU — [https://arxiv.org/abs/2311.16502](https://arxiv.org/abs/2311.16502)
- LMSYS Chatbot Arena — [https://arxiv.org/abs/2403.04132](https://arxiv.org/abs/2403.04132)
- HyperCLOVA X — [https://arxiv.org/abs/2404.01954](https://arxiv.org/abs/2404.01954)
- KMMLU — [https://arxiv.org/abs/2402.11548](https://arxiv.org/abs/2402.11548)
- Sakana EvoLLM — [https://arxiv.org/abs/2403.13187](https://arxiv.org/abs/2403.13187)
- Swallow — [https://arxiv.org/abs/2404.17790](https://arxiv.org/abs/2404.17790)
- Sleeper Agents — [https://arxiv.org/abs/2401.05566](https://arxiv.org/abs/2401.05566)
- HuggingFace — [https://huggingface.co/](https://huggingface.co/)
- Meta AI Research — [https://ai.meta.com/research/](https://ai.meta.com/research/)
- DeepSeek — [https://www.deepseek.com/](https://www.deepseek.com/)
- Qwen — [https://qwenlm.github.io/](https://qwenlm.github.io/)
- Mistral AI — [https://mistral.ai/news/](https://mistral.ai/news/)
- OpenAI Research — [https://openai.com/research/](https://openai.com/research/)
- Anthropic Research — [https://www.anthropic.com/research](https://www.anthropic.com/research)
- Google DeepMind Research — [https://deepmind.google/research/](https://deepmind.google/research/)
- vLLM — [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
- SGLang — [https://github.com/sgl-project/sglang](https://github.com/sgl-project/sglang)
현재 단락 (1/300)
Between January 2024 and May 2026, arXiv `cs.CL` and `cs.LG` averaged over 1,200 new submissions per...