Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — You don't have to read every paper, but you need the map

The LLM field produces too many papers. Hundreds appear on arXiv every week, and Twitter, blogs, and newsletters all shout "this one is the game-changer." You can't read them all, and not all of them matter.

But landmarks exist — papers that changed every current that came after. Know them and, when a new paper drops, you can see "this is a follow-on to X." Without them, you start from scratch every time.

This piece organizes the 20-odd landmark LLM papers by era and theme. Each paper gets:

Why it matters — what was first, what it made possible
TL;DR — the core idea
Follow-on impact — what streams it fed into

The goal is not "read them all." It's a map. If you know where a paper sits, you can find it precisely when you need it. All arXiv links are collected at the end.

This is a map of papers (ideas and methods), not a catalog of models (GPT-4, Claude, Gemini, etc. — products). Products turn over in six months; ideas last.

Chapter 1 · Foundations — Before and at the start of the Transformer

Attention is All You Need (Vaswani et al., 2017)

Why it matters — The origin of every modern LLM. Throws out RNNs and LSTMs and presents the Self-Attention–based Transformer: parallelizable and strong on long sequences.
TL;DR — "Attention alone is enough for sequence modeling — and it works better."
Follow-on impact — GPT, BERT, T5, LLaMA, Claude — all descendants of this architecture. Even post-2024 non-Transformer attempts (Mamba, RWKV) define themselves against the Transformer.

BERT (Devlin et al., 2018)

Why it matters — Introduces the bidirectional encoder + masked LM pretraining paradigm. Popularized the "pretrain then fine-tune" workflow that became standard practice in NLP.
TL;DR — "A Transformer encoder that sees context from both sides at once."
Follow-on impact — The standard for classification, retrieval, and embedding models. Ancestor of today's embedding models (text-embedding-3, BGE, Voyage, and so on).

Chapter 2 · Scaling and the GPT lineage

GPT-2 (Radford et al., 2019)

Why it matters — The discovery that language models are unsupervised multitask learners. Evidence that scaling size and data lets a model perform diverse tasks zero/few-shot, with no per-task fine-tuning.
TL;DR — "Make it big enough and it does things you never taught it."
Follow-on impact — The start of the "scaling" paradigm. The road to GPT-3, 4, and 5.

GPT-3 (Brown et al., 2020) — "Language Models are Few-Shot Learners"

Why it matters — First strong demonstration that in-context learning works. Give the model a few examples in the prompt and it handles a new task without training. 175B parameters.
TL;DR — "Put examples in the prompt and the model does new tasks."
Follow-on impact — The field of "prompt engineering" starts here. The direct ancestor of ChatGPT.

Scaling Laws (Kaplan et al., 2020 → Chinchilla, Hoffmann et al., 2022)

Why it matters — Quantifies how model performance relates to parameter count, data, and compute. Chinchilla showed GPT-3 was actually data-starved, and proposed the optimal model/data ratio.
TL;DR — "Grow the data as much as you grow the model."
Follow-on impact — Opens the era of "small but well-fed" efficient models like LLaMA and Mistral.

Chapter 3 · Aligning to human preferences — RLHF and after

InstructGPT / RLHF (Ouyang et al., 2022)

Why it matters — The recipe for fine-tuning a pretrained LLM on human preferences to produce a "helpful and harmless" assistant. The technical foundation of ChatGPT.
TL;DR — "SFT, then train a reward model, then optimize the policy with PPO."
Follow-on impact — Standard training procedure for every conversational LLM. The practical starting point for the field of "alignment."

Constitutional AI (Bai et al., 2022) — Anthropic

Why it matters — Instead of human labels, the AI critiques and revises its own outputs against a set of principles (a constitution). Cuts human-labeling cost and pursues more consistent safety.
TL;DR — "Replace much of the H (human) in RLHF with AI."
Follow-on impact — Claude's core training method. The origin of the RLAIF (AI feedback) line of work.

DPO (Rafailov et al., 2023) — Direct Preference Optimization

Why it matters — Skips PPO and the reward model in RLHF and optimizes the policy directly on preference-pair data. Much simpler and more stable.
TL;DR — "Alignment from preference data alone, no reward model."
Follow-on impact — The de facto standard for open-source fine-tuning. Spawned variants like ORPO and KTO.

Chapter 4 · Eliciting reasoning — from Chain-of-Thought to o1

Chain-of-Thought Prompting (Wei et al., 2022)

Why it matters — Shows that a single line of prompt — "Let's think step by step" — dramatically improves a model's reasoning. First strong evidence that a simple prompting trick can unlock new capabilities.
TL;DR — "Make the model write its reasoning step by step and it solves more."
Follow-on impact — An explosion of "reasoning-elicitation" methods (Tree-of-Thoughts, Self-Consistency, Reflexion). Eventually leads to reasoning models like o1.

Self-Consistency (Wang et al., 2022)

Why it matters — Sample multiple reasoning paths and decide by majority vote. The natural extension of CoT.
TL;DR — "Solve it several times, take the most common answer."
Follow-on impact — An early instance of the "spend more compute at inference time to lift accuracy" (test-time compute) line.

ReAct (Yao et al., 2022)

Why it matters — An agent pattern that interleaves Reasoning and Action. The model repeats "think → call a tool → observe → think again."
TL;DR — "Reasoning and tool use in a single loop."
Follow-on impact — The default pattern in essentially every AI agent harness.

OpenAI o1 / o3 system cards (2024–2025)

Why it matters — Reasoning models built by scaling test-time compute via reinforcement learning. Instead of short answers, they generate long chains of thought and self-verify or self-correct.
TL;DR — "Let it think longer and it solves harder problems."
Follow-on impact — Kicks off the reasoning-model race — DeepSeek-R1, Claude's thinking mode, Gemini's Deep Think, and others.

DeepSeek-R1 (DeepSeek-AI, 2025)

Why it matters — Publicly demonstrates that reasoning can be elicited via pure reinforcement learning (RLVR — Reinforcement Learning from Verifiable Rewards). Released with open weights, accelerating reasoning-model research.
TL;DR — "Train reasoning without human labels — only with verifiable rewards."
Follow-on impact — An explosion of open-source reasoning models and replication work. Overturns the assumption that "RL is expensive."

Chapter 5 · Efficiency and open models — the LLaMA era

LLaMA / LLaMA 2 / LLaMA 3 (Touvron et al., 2023–2024) — Meta

Why it matters — The decisive arrival of high-quality open-weight models. Putting Chinchilla's lesson into practice (small but well-fed with data), it showed that a small model could be strong.
TL;DR — "Open weights plus a small model fed enough data."
Follow-on impact — The foundation of the entire open-weight ecosystem — Mistral, Qwen, Gemma, DeepSeek, Yi, and more. The origin point of the fine-tuning industry.

Mixtral 8x7B (Jiang et al., 2024) — Mixture-of-Experts

Why it matters — Proves that sparse MoE (Sparse Mixture-of-Experts) can work in practice with open weights. Activates only a subset of experts at inference time, cutting cost.
TL;DR — "Big total parameters, small active parameters."
Follow-on impact — Nearly every state-of-the-art model has moved toward MoE — DeepSeek-V3, Qwen3-MoE, and (per rumor) GPT-4.

FlashAttention (Dao et al., 2022) → FlashAttention-2/3

Why it matters — Rewrites attention to be IO-aware with the GPU memory hierarchy. Faster and more memory-efficient in both training and inference at the same time.
TL;DR — "Rewrite attention so you get the same result, cheaper."
Follow-on impact — Effectively the default in every LLM training/inference stack. The springboard for PagedAttention (vLLM), xFormers, and others.

Chapter 6 · Context length, retrieval, and external tools

RAG (Lewis et al., 2020) — Retrieval-Augmented Generation

Why it matters — Inject retrieved external knowledge into the LLM to reduce hallucinations and add freshness. The naming of the retrieval + generation paradigm.
TL;DR — "Retrieve first, then answer with that context."
Follow-on impact — The foundation of nearly every enterprise LLM app. RAG is an industry of its own.

Toolformer (Schick et al., 2023) → Tool/Function Calling

Why it matters — The LLM teaches itself how to call external tools (APIs, calculators, search). OpenAI's function calling and Anthropic's tool use later productize this line.
TL;DR — "The model decides for itself: 'Should I use the API?'"
Follow-on impact — The tool-use paradigm in every AI agent. The line runs all the way to MCP (Model Context Protocol).

Lost in the Middle (Liu et al., 2023)

Why it matters — Empirical evidence that, in long contexts, models use the beginning and end well but drop the middle. Shatters the illusion that "long context equals good context."
TL;DR — "Models barely read the middle of the context window."
Follow-on impact — A central citation in context engineering. The motivation behind retrieval, re-ranking, and context-compression research.

Chapter 7 · Multimodal

CLIP (Radford et al., 2021)

Why it matters — Contrastive learning that puts images and text in the same embedding space. The basis for zero-shot image classification and text-to-image (Stable Diffusion, etc.).
TL;DR — "Align images and captions in the same vector space."
Follow-on impact — DALL-E, Stable Diffusion, CLIP-based search, and the encoder behind nearly every VLM.

ViT (Dosovitskiy et al., 2020) — Vision Transformer

Why it matters — Treats an image as a sequence of patches and shows that the Transformer works in vision too. The first event to shake the CNN monopoly.
TL;DR — "Slice an image like words and feed it to a Transformer."
Follow-on impact — DETR, Swin, SAM, LLaVA — all of vision and the VLM world.

LLaVA / GPT-4V — Vision-Language Models

Why it matters — Attach a vision encoder plus a projection to an LLM and you get a practical recipe for multimodal LLMs.
TL;DR — "Project the vision encoder's output into the LLM's token space."
Follow-on impact — The standard architecture for multimodal assistants — Claude 3+ Vision, Gemini, Qwen-VL, and others.

Chapter 8 · Agents and evaluation

Reflexion (Shinn et al., 2023)

Why it matters — The agent self-critiques its output and incorporates the critique on the next attempt. Clear gains in coding and reasoning.
TL;DR — "Fail, reflect, try again."
Follow-on impact — Nearly every agent harness with a self-correction loop.

SWE-bench (Jimenez et al., 2023)

Why it matters — A benchmark that measures an LLM's ability to resolve real GitHub issues. Evaluation on real code, not toys.
TL;DR — "Turn the benchmark into GitHub issues."
Follow-on impact — SWE-bench Verified is effectively the standard metric for coding agents — the comparison axis for Devin, Cursor, Claude Code, and others.

ARC-AGI / ARC-AGI-2 (Chollet, 2019 / 2025)

Why it matters — A benchmark for abstract reasoning that cannot be solved with data alone. Tests whether an LLM generalizes or merely pattern-matches.
TL;DR — "A litmus test for abstract reasoning and generalization."
Follow-on impact — Resurgent in the reasoning-model era. ARC-AGI-2 raised the bar.

Chapter 9 · Safety, interpretability, and alignment

Sleeper Agents (Hubinger et al., 2024) — Anthropic

Why it matters — Can safety training remove hidden backdoors from a model? The finding: some backdoors survive standard alignment training.
TL;DR — "Alignment training cannot fully erase a backdoor."
Follow-on impact — A wake-up call for AI safety research. Highlights the importance of pretraining-data vetting and interpretability.

Mechanistic Interpretability — Toy Models of Superposition (Elhage et al., 2022) and others

Why it matters — An attempt to understand the inside of a model in terms of circuits. The interpretability line at Anthropic, OpenAI, and others.
TL;DR — "See what computation is happening inside a neural network, as circuits."
Follow-on impact — Gradually recognized as a foundation for safety, debugging, and alignment. From 2025 on, dictionary learning and SAEs (sparse autoencoders) draw attention.

Chapter 10 · How do you keep up — a practical guide

You don't have to read all 20. The strategy I recommend.

Priorities

Must read: Attention is All You Need, GPT-3, InstructGPT, RAG, ReAct.
Concept-level is enough: the rest — the summaries above will do.
Read deeply in your area: coding agents — SWE-bench and Reflexion; vision — ViT, CLIP, and LLaVA; reasoning — o1 and DeepSeek-R1.

A keep-up workflow

Subscribe to the arXiv daily digest (cs.CL / cs.AI). Skim headlines and read one paper deeply per week.
Blogs and newsletters: Anthropic Research, OpenAI Blog, DeepMind Blog, Jay Alammar (visualizations), Lilian Weng's Log, Sebastian Raschka, Simon Willison, Latent Space.
Replications and explainers: popular papers usually get a writeup with code on HuggingFace blog, Eugene Yan, and Simon Willison. Reading the original plus an explainer is the most efficient combo.
Ask an LLM: drop the PDF into a model and start with "the three core contributions of this paper." Watch for hallucinations — always verify citations against the source.

Epilogue — With a map, you don't get lost

The LLM field moves fast. That's exactly why a map is valuable. When a new paper drops, being able to place it — "this is a CoT follow-on," "this is an MoE variant," "this is in the DPO family" — gets you halfway there.

These 20 papers are your coordinate system. You don't have to read them all deeply. You just have to know where each one sits.

A 5-item checklist

Have you read Attention is All You Need at least once, yourself?
Can you explain the difference between RLHF and DPO in one sentence?
Do you have the relationship among CoT, Self-Consistency, and o1 in your head?
Can you name three landmarks in your own area?
Are you subscribed to at least one daily digest or curated source?

References

Core papers, blogs, and pages — arXiv links go to the abstract page.

Foundational architecture

Vaswani et al., "Attention Is All You Need" (2017): https://arxiv.org/abs/1706.03762
Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (2018): https://arxiv.org/abs/1810.04805

Scaling and GPT

Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2, 2019): https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Brown et al., "Language Models are Few-Shot Learners" (GPT-3, 2020): https://arxiv.org/abs/2005.14165
Kaplan et al., "Scaling Laws for Neural Language Models" (2020): https://arxiv.org/abs/2001.08361
Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022): https://arxiv.org/abs/2203.15556

Alignment

Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022): https://arxiv.org/abs/2203.02155
Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (2022): https://arxiv.org/abs/2212.08073
Rafailov et al., "Direct Preference Optimization" (DPO, 2023): https://arxiv.org/abs/2305.18290

Reasoning

Wei et al., "Chain-of-Thought Prompting Elicits Reasoning" (2022): https://arxiv.org/abs/2201.11903
Wang et al., "Self-Consistency Improves Chain of Thought Reasoning" (2022): https://arxiv.org/abs/2203.11171
Yao et al., "ReAct: Synergizing Reasoning and Acting" (2022): https://arxiv.org/abs/2210.03629
OpenAI "Learning to Reason with LLMs" (o1 blog, 2024): https://openai.com/index/learning-to-reason-with-llms/
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability via Reinforcement Learning" (2025): https://arxiv.org/abs/2501.12948

Open models and efficiency

Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (2023): https://arxiv.org/abs/2302.13971
Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023): https://arxiv.org/abs/2307.09288
Meta AI, "The Llama 3 Herd of Models" (2024): https://arxiv.org/abs/2407.21783
Jiang et al., "Mixtral of Experts" (2024): https://arxiv.org/abs/2401.04088
Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention" (2022): https://arxiv.org/abs/2205.14135

Retrieval, tools, and context

Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (RAG, 2020): https://arxiv.org/abs/2005.11401
Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023): https://arxiv.org/abs/2302.04761
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023): https://arxiv.org/abs/2307.03172

Multimodal

Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021): https://arxiv.org/abs/2103.00020
Dosovitskiy et al., "An Image is Worth 16x16 Words" (ViT, 2020): https://arxiv.org/abs/2010.11929
Liu et al., "Visual Instruction Tuning" (LLaVA, 2023): https://arxiv.org/abs/2304.08485

Agents and evaluation

Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (2023): https://arxiv.org/abs/2303.11366
Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (2023): https://arxiv.org/abs/2310.06770
Chollet, "On the Measure of Intelligence" (ARC, 2019): https://arxiv.org/abs/1911.01547
Chollet et al., "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems" (2025): https://arxiv.org/abs/2505.11831

Safety and interpretability

Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" (2024): https://arxiv.org/abs/2401.05566
Elhage et al., "Toy Models of Superposition" (Anthropic, 2022): https://transformer-circuits.pub/2022/toy_model/index.html

Curation and explainers (recommended regular reads)

Anthropic Research: https://www.anthropic.com/research
OpenAI Research: https://openai.com/research
Lilian Weng's Log: https://lilianweng.github.io/
Jay Alammar (visual explanations): https://jalammar.github.io/
Sebastian Raschka, Ahead of AI: https://magazine.sebastianraschka.com/
Simon Willison, Weblog: https://simonwillison.net/
Latent Space (Swyx and Alessio): https://www.latent.space/
The Gradient: https://thegradient.pub/

"What matters more than the latest paper is knowing where that paper sits on the map."

— LLM Landmark Papers Guide, end.