💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — Surviving the 2026 LLM Paper Firehose

Between January 2024 and May 2026, arXiv `cs.CL` and `cs.LG` averaged over 1,200 new submissions per week. Filter down to LLM-specific work and you still get ~300 a week, ~15,000 a year. No single person can read it all.

The question a working engineer asks in 2026 is therefore simple: **"which 30 papers actually help the system I'm building today?"**

This post curates that 30 plus a margin. Three criteria:

- **Reproducible** — code, weights, or enough detail to rebuild.

- **Cited in the field** — referenced in model cards, benchmark reports, production blog posts.

- **Durable** — the core insight survives the next model release six months out.

> One-line summary: read in this order — **foundation model reports → MoE / attention innovations → RLHF and DPO family → CoT and reasoning → agents and retrieval → FlashAttention and serving → evaluation and safety**. One week and the whole 2026 LLM landscape is in your head.

1. Llama 3 — the New Open-Weight Baseline

**Llama 3 / Llama 3.3 Technical Report** (2024-07, [arXiv:2407.21783](https://arxiv.org/abs/2407.21783))

Meta released Llama 3 across 8B, 70B, and 405B and effectively reset the open-weight baseline. The 92-page technical report documents the **data curation pipeline** (15T tokens), **scaling law re-validation**, **post-training recipe** (SFT + DPO + Rejection Sampling), and **infrastructure** (a 16K H100 cluster with 419 interruptions; the most common failures were GPU, then memory, then NIC). A single report is the de facto textbook on how a modern LLM is built. The 8B variant is still, in 2026, the most common fine-tuning base.

Llama 3.3 70B kept the architecture and only strengthened post-training, reaching GPT-4o-level instruction following. With Llama 4 shipping a multimodal MoE in mid-2025, "Llama equals open LLM standard" is now the working assumption.

2. DeepSeek-V3 and R1 — Peak MoE and Reasoning RL

**DeepSeek-V3 Technical Report** (2024-12, [arXiv:2412.19437](https://arxiv.org/abs/2412.19437))

A 671B-parameter MoE trained on 14.8T tokens, reportedly for ~$5.58M of H800 time. That headline number shook the industry. The technical contributions worth knowing: **MLA (Multi-head Latent Attention)** compresses KV cache by ~10x; **DeepSeekMoE** uses 256 routed experts plus 1 shared expert; **auxiliary-loss-free load balancing**, **FP8 training**, and **DualPipe pipeline parallelism** are now standard references for follow-on open models.

**DeepSeek-R1** (2025-01, [arXiv:2501.12948](https://arxiv.org/abs/2501.12948))

R1 takes V3 as the base and reaches o1-class reasoning **with pure RL**. The key algorithm is **GRPO (Group Relative Policy Optimization)**, which drops PPO's value network to save memory. The R1-Zero report — pure RL, no SFT — describes an "aha moment" where the model starts emitting self-review tokens like "Wait, let me reconsider…", one of the most-cited results of 2025.

3. The Qwen Series — Trilingual Strength from China

**Qwen2.5 Technical Report** (2024-12, [arXiv:2412.15115](https://arxiv.org/abs/2412.15115)) and **Qwen3 Technical Report** (2025-Q2) cover sizes from 0.5B up to 72B, plus 128K-context, multimodal, math- and code-specialized variants. The Qwen family often beats Llama on **CJK (Chinese, Japanese, Korean)** workloads, and Qwen2.5-Coder 32B held the top SWE-Bench score among open-weight coding models for some time. In 2026 it is the most common base for Korean and Japanese startups training their own models.

4. Mistral and Mistral Large 2 — Europe Responds

**Mistral 7B** (2023-10, [arXiv:2310.06825](https://arxiv.org/abs/2310.06825)) combined sliding-window attention and grouped-query attention to beat Llama 2 13B at 7B size — a milestone. In 2024 **Mistral Large 2** (123B) and in 2025 **Mistral Medium 3** shipped under Apache 2.0 or Mistral Research License, anchoring the European open-weight position. **Mixtral 8x7B** and **Mixtral 8x22B** defined the sparse-MoE standard before DeepSeek; **Codestral** at 22B is still a common code-specific pick.

5. The Phi Series — "Data Quality Equals Model Quality"

**Phi-3 Technical Report** (2024-04, [arXiv:2404.14219](https://arxiv.org/abs/2404.14219)) and **Phi-4** (2024-12, [arXiv:2412.08905](https://arxiv.org/abs/2412.08905)) are the high point of the Microsoft Research SLM (small language model) line. The thesis is simple: **train only on "textbook quality data"** and a 3.8B model can beat GPT-3.5. Phi-4 at 14B caught Llama 3 70B on GPQA and MATH, and **Phi-4-reasoning** showed an o1-mini-class reasoner — evidence that SLMs can reason too.

6. Gemma 3 and Falcon 3 — The Rest of the Open Camp

**Gemma 3 Technical Report** (2025-Q1) ships 1B / 4B / 12B / 27B and ports some Gemini 2.0 internals (attention variants, distillation) to open weights. 128K context and multimodality come built in.

**Falcon 3** (TII, UAE) and **Command R+** (Cohere) emphasize English, Arabic, and multilingual RAG rather than CJK. **Yi-Lightning** (01.AI) and **GLM-4-9B** (Zhipu) are less known outside China but show up high on Chatbot Arena.

7. Commercial Model Cards — GPT-4, Claude 4.7, Gemini 2.5

For closed models the **system card** is the source of record, not the paper.

- **GPT-4 Technical Report** (2023, [arXiv:2303.08774](https://arxiv.org/abs/2303.08774)) — architecture details withheld but the evaluation methodology and safety procedures set a baseline.

- **OpenAI o1 System Card** (2024-09) — the first commercial reasoning model, RL plus CoT integrated at training.

- **OpenAI o3 / o4 System Card** (2025) — the first model to clear average human on ARC-AGI.

- **Anthropic Claude 4 / 4.5 / 4.7 Model Card** — successors to Constitutional AI, sycophancy mitigation, citation features, computer use capability descriptions.

- **Google Gemini 1.5 / 2.0 / 2.5 Technical Report** ([arXiv:2403.05530](https://arxiv.org/abs/2403.05530)) — 1M to 10M token context with native multimodality.

You read commercial cards for **evaluation methodology, safety interventions, and limitations**, not for benchmark numbers.

8. Mixture-of-Experts — Switch Transformer to DeepSeekMoE

MoE re-emerged in 2021 with **Switch Transformer** ([arXiv:2101.03961](https://arxiv.org/abs/2101.03961)), continued through **GShard**, **GLaM**, and **ST-MoE**, and stepped up again in 2024 with **DeepSeekMoE** ([arXiv:2401.06066](https://arxiv.org/abs/2401.06066)). Two ideas matter: **fine-grained expert segmentation** (more, smaller experts) and **shared expert isolation** (separate experts for common knowledge). DeepSeek-V3's 256+1 expert configuration follows directly.

**Mixtral of Experts** ([arXiv:2401.04088](https://arxiv.org/abs/2401.04088)) activates top-2 of 8 experts and is the most cited sparse-MoE implementation. **OLMoE** (Allen AI) was the first MoE to publish full training code and data.

9. Attention Innovations — MLA, GQA, Sliding Window, Mamba

**GQA: Grouped-Query Attention** ([arXiv:2305.13245](https://arxiv.org/abs/2305.13245)) — multiple query heads share KV heads. Default in Llama 2/3, Mistral, and nearly every modern model.

**MLA: Multi-head Latent Attention** ([arXiv:2405.04434](https://arxiv.org/abs/2405.04434), DeepSeek-V2 paper) — low-rank compression of KV cache. ~80% memory savings at the same context.

**Sliding Window Attention** — used by Longformer ([arXiv:2004.05150](https://arxiv.org/abs/2004.05150)) and Mistral 7B. Local window plus global tokens.

**Mamba / Mamba-2** ([arXiv:2312.00752](https://arxiv.org/abs/2312.00752), [arXiv:2405.21060](https://arxiv.org/abs/2405.21060)) — SSM (state-space model) based. O(N) instead of O(N²) attention. Throughput wins on long context. Hybrid stacks (transformer + Mamba blocks) appeared experimentally in 2025-2026 — **Jamba** (AI21), **Zamba2** (Zyphra).

**RWKV-7** — attempts to match transformers with an RNN; a candidate for mobile and embedded.

10. The Reasoning Lineage — CoT, ToT, Self-Consistency, GRPO

**Chain-of-Thought Prompting** ([arXiv:2201.11903](https://arxiv.org/abs/2201.11903), Wei et al. 2022) — "Let's think step by step" doubles GSM8K accuracy.

**Self-Consistency** ([arXiv:2203.11171](https://arxiv.org/abs/2203.11171)) — sample many, majority-vote. +10-20% over single-sample on reasoning tasks.

**Tree-of-Thoughts** ([arXiv:2305.10601](https://arxiv.org/abs/2305.10601)) — search the reasoning tree. Effective on Game of 24 and creative writing.

**Reflexion** ([arXiv:2303.11366](https://arxiv.org/abs/2303.11366)) — log failures as text memory for the next attempt.

**OpenAI o1** (blog, 2024-09) and **DeepSeek-R1 GRPO** — surface long CoT through RL during training. The reason every 2026 frontier model has a "thinking" mode.

**Inference-Time Scaling Laws** ([arXiv:2408.03314](https://arxiv.org/abs/2408.03314)) — at fixed compute, spending more on inference can beat spending more on parameters.

Inference-time scaling — Best-of-N with a verifier

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

def best_of_n(prompt, n=16, verifier=None):

inputs = tok(prompt, return_tensors="pt")

candidates = []

for _ in range(n):

out = model.generate(

**inputs,

do_sample=True,

temperature=0.8,

max_new_tokens=512,

)

text = tok.decode(out[0], skip_special_tokens=True)

score = verifier(text) if verifier else len(text)

candidates.append((score, text))

return max(candidates, key=lambda x: x[0])[1]

11. The RLHF Lineage — InstructGPT, Constitutional AI, DPO

**InstructGPT** ([arXiv:2203.02155](https://arxiv.org/abs/2203.02155), Ouyang et al. 2022) — the canonical RLHF paper. PPO + reward model + KL penalty in three stages.

**Constitutional AI** ([arXiv:2212.08073](https://arxiv.org/abs/2212.08073), Anthropic 2022) — replaces human preferences with an **AI-authored constitution** for self-critique. The origin of RLAIF.

**DPO: Direct Preference Optimization** ([arXiv:2305.18290](https://arxiv.org/abs/2305.18290), Rafailov et al. 2023) — learns from preference pairs directly, no reward model. Eliminates PPO's machinery while matching performance. De facto standard since 2024.

**ORPO** ([arXiv:2403.07691](https://arxiv.org/abs/2403.07691)) — merges SFT and preference learning into one loss. Single-stage RLHF.

**KTO: Kahneman-Tversky Optimization** ([arXiv:2402.01306](https://arxiv.org/abs/2402.01306)) — trains from single labels (good/bad) instead of pairs. Lower labeling cost.

**SimPO** ([arXiv:2405.14734](https://arxiv.org/abs/2405.14734)) — drops the reference-model dependency of DPO. Memory savings.

Quick comparison:

| --- | --- | --- | --- |

| PPO (RLHF) | yes | yes | pair |

| DPO | no | yes | pair |

| ORPO | no | no | pair + SFT |

| KTO | no | yes | single |

| SimPO | no | no | pair |

12. Agents — ReAct, Voyager, SWE-Agent, OS-Atlas

**ReAct** ([arXiv:2210.03629](https://arxiv.org/abs/2210.03629)) — interleave reasoning and acting. Foundation of almost every LLM agent framework.

**Voyager** ([arXiv:2305.16291](https://arxiv.org/abs/2305.16291)) — lifelong-learning agent in Minecraft. Auto-builds a skill library.

**SWE-Agent** ([arXiv:2405.15793](https://arxiv.org/abs/2405.15793)) — designs an **agent-computer interface (ACI)** rather than reusing a human IDE. Pushed GPT-4 on SWE-Bench from 12.5% to 18.0%.

**OS-Atlas** ([arXiv:2410.23218](https://arxiv.org/abs/2410.23218)) — grounding model for GUI agents. Screen capture to coordinates/actions.

**Computer Use survey** — after Anthropic's Claude Computer Use (2024-10) the field got a proper benchmark in **OSWorld** ([arXiv:2404.07972](https://arxiv.org/abs/2404.07972)).

Minimal ReAct pseudo-code

def react_agent(task, tools, llm, max_steps=10):

trajectory = [f"Task: {task}"]

for step in range(max_steps):

thought = llm(trajectory + ["Thought:"])

action = llm(trajectory + ["Action:"])

if action.startswith("Finish"):

return action

observation = tools.run(action)

trajectory.append(f"Thought: {thought}\nAction: {action}\nObservation: {observation}")

return "Max steps reached"

13. The RAG Lineage — From the Original to GraphRAG

**RAG (Retrieval-Augmented Generation)** ([arXiv:2005.11401](https://arxiv.org/abs/2005.11401), Lewis et al. 2020) — combined retrieval and generation. Standard for open-domain QA.

**FiD: Fusion-in-Decoder** ([arXiv:2007.01282](https://arxiv.org/abs/2007.01282)) — fuses passages inside the decoder. Stronger than RAG but with higher decoder context cost.

**RETRO** ([arXiv:2112.04426](https://arxiv.org/abs/2112.04426), DeepMind) — 2T-token data store outside the model; chunk-wise retrieval.

**ColBERT / ColBERTv2** ([arXiv:2004.12832](https://arxiv.org/abs/2004.12832)) — late interaction. Token-level query-document matching, the accuracy standard for dense retrieval.

**Self-RAG** ([arXiv:2310.11511](https://arxiv.org/abs/2310.11511)) — the model decides whether to retrieve and emits self-reflection tokens.

**GraphRAG** ([arXiv:2404.16130](https://arxiv.org/abs/2404.16130), Microsoft 2024) — turns documents into a knowledge graph and searches via community summaries. Strong on global queries (summary, trend).

**Contextual Retrieval** (Anthropic blog, 2024-09) — prepend a context prefix to each chunk before embedding. Cuts retrieval failure from 49% to 35%.

14. FlashAttention 1/2/3 — Rediscovering the Memory Hierarchy

**FlashAttention** ([arXiv:2205.14135](https://arxiv.org/abs/2205.14135), Dao et al. 2022) — tile attention to stay inside SRAM. Cuts HBM I/O for a 7.6x speedup.

**FlashAttention-2** ([arXiv:2307.08691](https://arxiv.org/abs/2307.08691)) — re-architected work partitioning. 2x faster. Most training stacks migrated.

**FlashAttention-3** ([arXiv:2407.08608](https://arxiv.org/abs/2407.08608)) — exploits Hopper (H100/H200) async wgmma and TMA. 75% MFU on FP16, 1.2 PFLOPS on FP8.

Calling FlashAttention from torch — 2026 standard

q = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)

k = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)

v = torch.randn(2, 8, 4096, 128, device="cuda", dtype=torch.bfloat16)

PyTorch 2.x SDPA picks the FlashAttention backend automatically

with torch.backends.cuda.sdp_kernel(

enable_flash=True, enable_math=False, enable_mem_efficient=False

out = F.scaled_dot_product_attention(q, k, v, is_causal=True)

print(out.shape) # [2, 8, 4096, 128]

15. vLLM and SGLang — Serving Infrastructure Standards

**vLLM PagedAttention** ([arXiv:2309.06180](https://arxiv.org/abs/2309.06180), Kwon et al. 2023) — manages KV cache like OS paging. Memory fragmentation drops from ~90% to ~4%. Throughput 2-4x over HuggingFace TGI or NVIDIA Triton.

**SGLang RadixAttention** ([arXiv:2312.07104](https://arxiv.org/abs/2312.07104)) — shares KV cache in a radix tree. 5x faster on multi-turn or few-shot workloads with overlapping system prompts.

**Mixture-of-Depths** ([arXiv:2404.02258](https://arxiv.org/abs/2404.02258), DeepMind 2024) — dynamically skips transformer layers per token. Same quality, fewer FLOPS.

**Speculative Decoding** ([arXiv:2211.17192](https://arxiv.org/abs/2211.17192), Leviathan et al. 2022) — a small draft model proposes tokens, the large model verifies. Base 2-3x speedup.

vLLM standard serving config — 2026 production pattern

docker run --gpus all -p 8000:8000 \

-v ~/models:/models \

vllm/vllm-openai:latest \

--model /models/Llama-3.3-70B-Instruct \

--tensor-parallel-size 4 \

--max-model-len 32768 \

--gpu-memory-utilization 0.92 \

--enable-prefix-caching \

--enable-chunked-prefill

16. Long Context — RoPE, YaRN, LongLoRA

**RoPE: Rotary Positional Embedding** ([arXiv:2104.09864](https://arxiv.org/abs/2104.09864)) — the Llama-family positional encoding standard.

**YaRN** ([arXiv:2309.00071](https://arxiv.org/abs/2309.00071)) — NTK-aware scaling of RoPE. Extends a 4K-trained model to 128K.

**LongLoRA** ([arXiv:2309.12307](https://arxiv.org/abs/2309.12307)) — sparse local attention plus LoRA for efficient context extension.

**RingAttention** ([arXiv:2310.01889](https://arxiv.org/abs/2310.01889)) — ring-topology KV communication across devices. Enables training at 1M+ context.

**Activation Beacon** ([arXiv:2401.03462](https://arxiv.org/abs/2401.03462)) — compress context into beacon tokens. Efficient retrieval.

Gemini 1.5 Pro at 1M tokens and Gemini 2.5 at 10M sit on combinations of these techniques.

17. Code LLMs — StarCoder, DeepSeek Coder, Codestral

**StarCoder 2** ([arXiv:2402.19173](https://arxiv.org/abs/2402.19173), BigCode 2024) — 619 programming languages, 4T+ tokens. Full weights and training data open.

**DeepSeek Coder V2** ([arXiv:2406.11931](https://arxiv.org/abs/2406.11931)) — 236B MoE, 21B active. Matches GPT-4 Turbo on HumanEval and MBPP. V3 scales to 671B MoE.

**Codestral** (Mistral, 2024-05) — 22B, 80 languages, 32K context. Frequent pick for IDE integrations.

**Code Llama** ([arXiv:2308.12950](https://arxiv.org/abs/2308.12950)) — Code-specialized Llama 2 variant. Code Llama 70B briefly led open-weight coding.

**Qwen2.5-Coder** (32B) — Qwen's coding variant. Held #1 open SWE-Bench for a while.

18. Small Models — The SLM Renaissance

One of the bigger 2024-2026 shifts: **"small can punch above its weight."**

- **Phi-3.5 Mini** (3.8B) — strong general model that runs on phones.

- **Gemma 2B / 3 1B** — edge-friendly 1B-class.

- **Qwen2.5 3B / 7B** — multilingual SLM standards.

- **Mistral 7B / Mistral Nemo 12B** — classic-size standards.

- **SmolLM2** ([arXiv:2502.02737](https://arxiv.org/abs/2502.02737)) — 360M and 1.7B trained on 11T tokens. SmolLM-Corpus data catalog shipped alongside.

- **TinyLlama** ([arXiv:2401.02385](https://arxiv.org/abs/2401.02385)) — 1.1B trained on 3T tokens.

In 2026 most mobile and embedded LLMs derive from this set.

19. Evaluation — From MMLU and HumanEval to SWE-Bench and OSWorld

Traditional benchmarks:

- **MMLU** ([arXiv:2009.03300](https://arxiv.org/abs/2009.03300)) — 57-domain multiple choice.

- **GSM8K** ([arXiv:2110.14168](https://arxiv.org/abs/2110.14168)) — grade-school math.

- **MATH** ([arXiv:2103.03874](https://arxiv.org/abs/2103.03874)) — competition math.

- **HumanEval** ([arXiv:2107.03374](https://arxiv.org/abs/2107.03374)) — code completion.

- **BIG-Bench Hard** ([arXiv:2210.09261](https://arxiv.org/abs/2210.09261)).

2024-2026 next generation:

- **GPQA** ([arXiv:2311.12022](https://arxiv.org/abs/2311.12022)) — PhD-level STEM.

- **MMLU-Pro** ([arXiv:2406.01574](https://arxiv.org/abs/2406.01574)) — shuffled answers and harder MMLU.

- **ARC-AGI** (Chollet) — general-intelligence test. o3 first to clear average human.

- **SWE-Bench** ([arXiv:2310.06770](https://arxiv.org/abs/2310.06770)) and **SWE-Bench Verified** — real GitHub issue resolution.

- **OSWorld** ([arXiv:2404.07972](https://arxiv.org/abs/2404.07972)) — computer-use agents.

- **MMMU** ([arXiv:2311.16502](https://arxiv.org/abs/2311.16502)) — multimodal multiple choice.

- **LMSYS Chatbot Arena** ([arXiv:2403.04132](https://arxiv.org/abs/2403.04132)) — human pairwise vote, ELO.

In 2026 GSM8K and HumanEval are saturated on frontier models; the meaningful signal moved to SWE-Bench, OSWorld, GPQA, and ARC-AGI.

20. Main Model Comparison Table

| --- | --- | --- | --- | --- | --- | --- |

| Llama 3.1 70B | 2024-07 | 70B | 86.0 | 80.5 | 95.1 | 31.2 |

| Llama 3.3 70B | 2024-12 | 70B | 86.9 | 88.4 | 96.5 | 41.4 |

| DeepSeek-V3 | 2024-12 | 671B MoE | 88.5 | 89.0 | 89.3 | 42.0 |

| DeepSeek-R1 | 2025-01 | 671B MoE | 91.2 | 96.3 | 97.3 | 49.2 |

| Qwen2.5-72B | 2024-09 | 72B | 86.1 | 86.6 | 95.8 | 36.0 |

| Mistral Large 2 | 2024-07 | 123B | 84.0 | 92.0 | 93.0 | 32.0 |

| Phi-4 | 2024-12 | 14B | 84.8 | 82.6 | 80.4 | - |

| Gemma 3 27B | 2025-Q1 | 27B | 81.0 | 79.8 | 89.2 | 28.5 |

| GPT-4o | 2024-05 | ? | 88.7 | 90.2 | 95.8 | 33.2 |

| Claude 4.7 | 2026 | ? | 90.1 | 96.3 | 96.4 | 65+ |

| Gemini 2.5 Pro | 2025 | ? | 89.8 | 92.0 | 95.4 | 51.0 |

Numbers from each model card or the LMSYS / Open LLM Leaderboard averages. Don't read the table as a ranking — read it for "which axes saturate and which still have headroom each generation."

21. Safety and Alignment — Constitutional AI, Sycophancy, Refusal

**Constitutional AI** ([arXiv:2212.08073](https://arxiv.org/abs/2212.08073)) opened the door to reducing human labels in RLHF via model self-critique.

**Discovering Language Model Behaviors with Model-Written Evaluations** ([arXiv:2212.09251](https://arxiv.org/abs/2212.09251)) — measure subtle alignment failures like sycophancy using the model itself.

**Universal and Transferable Adversarial Attacks on Aligned Language Models** ([arXiv:2307.15043](https://arxiv.org/abs/2307.15043), the GCG attack) — adversarial suffixes can break alignment in a systematic way.

**Jailbreak Survey** ([arXiv:2402.13457](https://arxiv.org/abs/2402.13457)) — taxonomy of jailbreaks through 2024.

**Sleeper Agents** ([arXiv:2401.05566](https://arxiv.org/abs/2401.05566), Anthropic) — training-time backdoors survive standard safety training. An important paper on the limits of alignment.

**Tamper-Resistant Safeguards** ([arXiv:2408.00761](https://arxiv.org/abs/2408.00761)) — attempts to make open-weight safety robust to additional fine-tuning.

22. Korean Models — HyperCLOVA X, EXAONE 3.5, Kanana

**HyperCLOVA X Technical Report** ([arXiv:2404.01954](https://arxiv.org/abs/2404.01954), Naver 2024) — Korean-English bilingual plus Korean culture, law, and medical evaluation sets (KoBigBench, KMMLU). The de facto baseline report for Korean LLMs.

**EXAONE 3.5** (LG AI Research, 2024-12) — 2.4B / 7.8B / 32B. English-Korean bilingual, 32K context. Released under the EXAONE AI Model License rather than Apache 2.0, but research use is permitted.

**Kanana** (Kakao, 2025) — 2B / 8B / 32B. Korean and English. Internal LLM backbone for KakaoTalk.

**KORAi / KORani / KoGPT / Polyglot-Ko** — earlier Korean models. From 2025 the three above are the practical majors.

**KMMLU** ([arXiv:2402.11548](https://arxiv.org/abs/2402.11548)) — Korean MMLU. The default Korean-LLM evaluation.

23. Japanese Models — Sakana, Stockmark, Swallow, PLaMo

**Sakana AI Evolutionary Optimization of Model Merging Recipes** ([arXiv:2403.13187](https://arxiv.org/abs/2403.13187)) — evolution-algorithm-driven automatic multilingual model merging. EvoLLM-JP marked a new direction for Japanese LLMs.

**Stockmark-100b** (Stockmark, 2024) — 100B Japanese-English bilingual model trained on a Japanese business corpus.

**Swallow** (Tokyo Tech, [arXiv:2404.17790](https://arxiv.org/abs/2404.17790)) — continual pretraining of Llama 2/3 on Japanese corpora.

**PLaMo 2 / 100B** (Preferred Networks) — Japanese, English, code. PFN's own training corpus.

**NEC cotomi** — Japanese business-domain LLM. 130B and 7B variants.

**Rakuten AI 7B**, **Karasu**, **Stable LM Japanese** and other 7B-class Japanese models are plentiful.

**JGLUE / Japanese MT-Bench** — Japanese evaluation standards.

24. Data — Dolma, RedPajama, FineWeb

The three open-data majors.

- **Dolma** ([arXiv:2402.00159](https://arxiv.org/abs/2402.00159), AI2) — 3T tokens. Used for OLMo training.

- **RedPajama-Data-v2** (Together AI, 2023-10) — 30T tokens. Multilingual plus English.

- **FineWeb** ([arXiv:2406.17557](https://arxiv.org/abs/2406.17557), HuggingFace) — 15T tokens plus FineWeb-Edu 1.3T variant.

**The Pile** ([arXiv:2101.00027](https://arxiv.org/abs/2101.00027), EleutherAI) — the 800GB starting point of open LLMs in 2021.

Common Crawl and the cleanup pipelines on top of it (CCNet, DataComp-LM, **TxT360**, **Nemotron-CC**) are the 2026 standards for open-data rationalization.

25. Multimodal — LLaVA, CogVLM, Qwen-VL, Pixtral

**LLaVA** ([arXiv:2304.08485](https://arxiv.org/abs/2304.08485), 2023) — Vicuna plus a CLIP visual encoder plus a projection. The starting point of open multimodal.

**LLaVA-1.5 / LLaVA-NeXT** — better resolution handling and multi-turn.

**Qwen-VL / Qwen2-VL** ([arXiv:2308.12966](https://arxiv.org/abs/2308.12966), [arXiv:2409.12191](https://arxiv.org/abs/2409.12191)) — arbitrary resolution and multilingual OCR. Qwen2.5-VL adds video.

**Pixtral 12B** (Mistral, 2024-09) — Pixtral's vision encoder handles arbitrary-resolution patches.

**Idefics 3** (HuggingFace) — open data plus open weights multimodal.

**Molmo** (AI2, [arXiv:2409.17146](https://arxiv.org/abs/2409.17146)) — pointing as a training task. Strong fit for agent stacks.

26. Reading Order — A Curated 30 for 2026 Engineers

If you can only read 30, do them in this order:

1. Llama 3 Technical Report — the full picture of modern LLM training.

2. DeepSeek-V3 Technical Report — peak cost-efficient training.

3. DeepSeek-R1 — RL-based reasoning.

4. Mixtral of Experts — the MoE standard.

5. DeepSeekMoE — fine-grained MoE.

6. GQA and MLA — two axes of attention efficiency.

7. FlashAttention-2 — the training-speed standard.

8. vLLM PagedAttention — the serving standard.

9. SGLang RadixAttention — KV cache sharing.

10. CoT Prompting — the reasoning starting point.

11. DPO — the post-training standard.

12. Constitutional AI — origin of RLAIF.

13. ReAct — the agent starting point.

14. SWE-Agent — code-agent standard.

15. OSWorld — computer-use evaluation.

16. RAG (the original) — retrieval combination.

17. ColBERTv2 — dense retrieval accuracy.

18. GraphRAG — global RAG.

19. Self-RAG — self-retrieval.

20. YaRN — RoPE scaling.

21. RingAttention — long-context training.

22. Speculative Decoding — decoding acceleration.

23. Phi-3 / Phi-4 — SLM renaissance.

24. SmolLM2 — open SLM data.

25. MMLU and GPQA — the evaluation baselines.

26. SWE-Bench Verified — code evaluation.

27. LMSYS Chatbot Arena — human preference.

28. Sleeper Agents — limits of alignment.

29. HyperCLOVA X — Korean LLM baseline.

30. Sakana EvoLLM — model merging.

One paper a week for 30 weeks, or 30 days flat, gets the whole 2026 LLM landscape into your head.

References

- arxiv.org — [https://arxiv.org/](https://arxiv.org/)

- Llama 3 Technical Report — [https://arxiv.org/abs/2407.21783](https://arxiv.org/abs/2407.21783)

- DeepSeek-V3 Technical Report — [https://arxiv.org/abs/2412.19437](https://arxiv.org/abs/2412.19437)

- DeepSeek-R1 — [https://arxiv.org/abs/2501.12948](https://arxiv.org/abs/2501.12948)

- Qwen2.5 Technical Report — [https://arxiv.org/abs/2412.15115](https://arxiv.org/abs/2412.15115)

- Mistral 7B — [https://arxiv.org/abs/2310.06825](https://arxiv.org/abs/2310.06825)

- Mixtral of Experts — [https://arxiv.org/abs/2401.04088](https://arxiv.org/abs/2401.04088)

- Phi-3 Technical Report — [https://arxiv.org/abs/2404.14219](https://arxiv.org/abs/2404.14219)

- Phi-4 — [https://arxiv.org/abs/2412.08905](https://arxiv.org/abs/2412.08905)

- Gemini 1.5 — [https://arxiv.org/abs/2403.05530](https://arxiv.org/abs/2403.05530)

- Switch Transformer — [https://arxiv.org/abs/2101.03961](https://arxiv.org/abs/2101.03961)

- DeepSeekMoE — [https://arxiv.org/abs/2401.06066](https://arxiv.org/abs/2401.06066)

- GQA — [https://arxiv.org/abs/2305.13245](https://arxiv.org/abs/2305.13245)

- MLA / DeepSeek-V2 — [https://arxiv.org/abs/2405.04434](https://arxiv.org/abs/2405.04434)

- Mamba — [https://arxiv.org/abs/2312.00752](https://arxiv.org/abs/2312.00752)

- Mamba-2 — [https://arxiv.org/abs/2405.21060](https://arxiv.org/abs/2405.21060)

- Chain-of-Thought — [https://arxiv.org/abs/2201.11903](https://arxiv.org/abs/2201.11903)

- Self-Consistency — [https://arxiv.org/abs/2203.11171](https://arxiv.org/abs/2203.11171)

- Tree-of-Thoughts — [https://arxiv.org/abs/2305.10601](https://arxiv.org/abs/2305.10601)

- Inference-Time Scaling — [https://arxiv.org/abs/2408.03314](https://arxiv.org/abs/2408.03314)

- InstructGPT — [https://arxiv.org/abs/2203.02155](https://arxiv.org/abs/2203.02155)

- Constitutional AI — [https://arxiv.org/abs/2212.08073](https://arxiv.org/abs/2212.08073)

- DPO — [https://arxiv.org/abs/2305.18290](https://arxiv.org/abs/2305.18290)

- ORPO — [https://arxiv.org/abs/2403.07691](https://arxiv.org/abs/2403.07691)

- KTO — [https://arxiv.org/abs/2402.01306](https://arxiv.org/abs/2402.01306)

- SimPO — [https://arxiv.org/abs/2405.14734](https://arxiv.org/abs/2405.14734)

- ReAct — [https://arxiv.org/abs/2210.03629](https://arxiv.org/abs/2210.03629)

- Voyager — [https://arxiv.org/abs/2305.16291](https://arxiv.org/abs/2305.16291)

- SWE-Agent — [https://arxiv.org/abs/2405.15793](https://arxiv.org/abs/2405.15793)

- OS-Atlas — [https://arxiv.org/abs/2410.23218](https://arxiv.org/abs/2410.23218)

- OSWorld — [https://arxiv.org/abs/2404.07972](https://arxiv.org/abs/2404.07972)

- RAG — [https://arxiv.org/abs/2005.11401](https://arxiv.org/abs/2005.11401)

- FiD — [https://arxiv.org/abs/2007.01282](https://arxiv.org/abs/2007.01282)

- RETRO — [https://arxiv.org/abs/2112.04426](https://arxiv.org/abs/2112.04426)

- ColBERT — [https://arxiv.org/abs/2004.12832](https://arxiv.org/abs/2004.12832)

- Self-RAG — [https://arxiv.org/abs/2310.11511](https://arxiv.org/abs/2310.11511)

- GraphRAG — [https://arxiv.org/abs/2404.16130](https://arxiv.org/abs/2404.16130)

- FlashAttention — [https://arxiv.org/abs/2205.14135](https://arxiv.org/abs/2205.14135)

- FlashAttention-2 — [https://arxiv.org/abs/2307.08691](https://arxiv.org/abs/2307.08691)

- FlashAttention-3 — [https://arxiv.org/abs/2407.08608](https://arxiv.org/abs/2407.08608)

- vLLM PagedAttention — [https://arxiv.org/abs/2309.06180](https://arxiv.org/abs/2309.06180)

- SGLang — [https://arxiv.org/abs/2312.07104](https://arxiv.org/abs/2312.07104)

- Speculative Decoding — [https://arxiv.org/abs/2211.17192](https://arxiv.org/abs/2211.17192)

- Mixture-of-Depths — [https://arxiv.org/abs/2404.02258](https://arxiv.org/abs/2404.02258)

- RoPE — [https://arxiv.org/abs/2104.09864](https://arxiv.org/abs/2104.09864)

- YaRN — [https://arxiv.org/abs/2309.00071](https://arxiv.org/abs/2309.00071)

- LongLoRA — [https://arxiv.org/abs/2309.12307](https://arxiv.org/abs/2309.12307)

- RingAttention — [https://arxiv.org/abs/2310.01889](https://arxiv.org/abs/2310.01889)

- Activation Beacon — [https://arxiv.org/abs/2401.03462](https://arxiv.org/abs/2401.03462)

- StarCoder 2 — [https://arxiv.org/abs/2402.19173](https://arxiv.org/abs/2402.19173)

- DeepSeek Coder V2 — [https://arxiv.org/abs/2406.11931](https://arxiv.org/abs/2406.11931)

- Code Llama — [https://arxiv.org/abs/2308.12950](https://arxiv.org/abs/2308.12950)

- MMLU — [https://arxiv.org/abs/2009.03300](https://arxiv.org/abs/2009.03300)

- GSM8K — [https://arxiv.org/abs/2110.14168](https://arxiv.org/abs/2110.14168)

- MATH — [https://arxiv.org/abs/2103.03874](https://arxiv.org/abs/2103.03874)

- HumanEval — [https://arxiv.org/abs/2107.03374](https://arxiv.org/abs/2107.03374)

- GPQA — [https://arxiv.org/abs/2311.12022](https://arxiv.org/abs/2311.12022)

- SWE-Bench — [https://arxiv.org/abs/2310.06770](https://arxiv.org/abs/2310.06770)

- MMMU — [https://arxiv.org/abs/2311.16502](https://arxiv.org/abs/2311.16502)

- LMSYS Chatbot Arena — [https://arxiv.org/abs/2403.04132](https://arxiv.org/abs/2403.04132)

- HyperCLOVA X — [https://arxiv.org/abs/2404.01954](https://arxiv.org/abs/2404.01954)

- KMMLU — [https://arxiv.org/abs/2402.11548](https://arxiv.org/abs/2402.11548)

- Sakana EvoLLM — [https://arxiv.org/abs/2403.13187](https://arxiv.org/abs/2403.13187)

- Swallow — [https://arxiv.org/abs/2404.17790](https://arxiv.org/abs/2404.17790)

- Sleeper Agents — [https://arxiv.org/abs/2401.05566](https://arxiv.org/abs/2401.05566)

- HuggingFace — [https://huggingface.co/](https://huggingface.co/)

- Meta AI Research — [https://ai.meta.com/research/](https://ai.meta.com/research/)

- DeepSeek — [https://www.deepseek.com/](https://www.deepseek.com/)

- Qwen — [https://qwenlm.github.io/](https://qwenlm.github.io/)

- Mistral AI — [https://mistral.ai/news/](https://mistral.ai/news/)

- OpenAI Research — [https://openai.com/research/](https://openai.com/research/)

- Anthropic Research — [https://www.anthropic.com/research](https://www.anthropic.com/research)

- Google DeepMind Research — [https://deepmind.google/research/](https://deepmind.google/research/)

- vLLM — [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)

- SGLang — [https://github.com/sgl-project/sglang](https://github.com/sgl-project/sglang)