Split View: LLM 랜드마크 논문 가이드 — Attention부터 GPT·LLaMA·DeepSeek·o1·Claude까지 (참고문헌 포함, 2026)
LLM 랜드마크 논문 가이드 — Attention부터 GPT·LLaMA·DeepSeek·o1·Claude까지 (참고문헌 포함, 2026)
프롤로그 — 논문을 안 읽어도, 지도는 있어야 한다
LLM 분야는 논문이 너무 많이 나온다. 매주 arXiv에 수백 편이 올라오고, 트위터·블로그·뉴스레터가 "이 논문이 게임체인저"라고 외친다. 다 읽을 수도 없고, 다 중요하지도 않다.
하지만 랜드마크는 있다. 그 후의 모든 흐름을 바꾼 논문들. 이걸 알면 새 논문이 나왔을 때 "이게 무엇의 후속인가"가 보인다. 모르면 매번 처음부터 본다.
이 글은 LLM의 랜드마크 논문 20여 편을 시기·주제별로 정리한다. 각 논문은:
- 왜 중요한가 — 무엇이 처음이었나, 무엇을 가능케 했나
- 한 줄 요약 — 핵심 아이디어
- 후속 영향 — 어떤 흐름으로 이어졌나
목적은 "다 읽으세요"가 아니라 지도다. 어떤 논문이 어디 위치하는지 알면 필요할 때 정확히 찾을 수 있다. 끝에는 모든 arXiv 링크를 모아 둔다.
이 글은 모델 자체(GPT-4·Claude·Gemini 등 제품) 카탈로그가 아니라 논문(아이디어와 방법) 지도다. 제품은 6개월이면 바뀌지만, 아이디어는 오래 간다.
1장 · 기반 — Transformer 이전과 시작
Attention is All You Need (Vaswani et al., 2017)
- 왜 중요한가 — 모든 현대 LLM의 출발점. RNN·LSTM을 폐기하고 Self-Attention 기반 Transformer를 제시. 병렬화 가능, 긴 시퀀스에서 강함.
- 한 줄 요약 — "Attention만으로 시퀀스 모델링이 가능하고, 더 잘된다."
- 후속 영향 — GPT·BERT·T5·LLaMA·Claude — 전부 이 아키텍처의 후예. 2024년 이후의 Mamba·RWKV 같은 비-Transformer 시도들도 결국 Transformer를 기준점으로 정의된다.
BERT (Devlin et al., 2018)
- 왜 중요한가 — 양방향 인코더 + masked LM 사전학습 패러다임 제시. NLP에서 "사전학습 + 미세조정"이라는 모범 워크플로를 대중화.
- 한 줄 요약 — "문장의 양쪽 문맥을 동시에 보는 Transformer 인코더."
- 후속 영향 — 분류·검색·임베딩 모델의 표준. 임베딩 모델(
text-embedding-3, BGE, Voyage 등)의 조상.
2장 · 스케일링과 GPT 계보
GPT-2 (Radford et al., 2019)
- 왜 중요한가 — "언어 모델은 비지도 멀티태스크 학습자"라는 발견. 크기와 데이터를 늘리면 별도 미세조정 없이도 zero/few-shot으로 다양한 태스크를 한다는 증거.
- 한 줄 요약 — "크게 만들면, 가르치지 않은 것도 한다."
- 후속 영향 — "스케일링" 패러다임의 시작. GPT-3·4·5의 길.
GPT-3 (Brown et al., 2020) — "Language Models are Few-Shot Learners"
- 왜 중요한가 — In-context learning이 처음으로 강력하게 작동함을 보임. 모델에 예시 몇 개만 주면 학습 없이 새 태스크를 해낸다. 175B 파라미터.
- 한 줄 요약 — "프롬프트에 예시를 넣으면 모델이 새 태스크를 한다."
- 후속 영향 — "프롬프트 엔지니어링"이라는 분야 자체가 여기서 시작. ChatGPT의 직접 조상.
Scaling Laws (Kaplan et al., 2020 → Chinchilla, Hoffmann et al., 2022)
- 왜 중요한가 — 모델 성능이 파라미터 수·데이터·연산량과 어떻게 관계되는지 정량화. Chinchilla는 GPT-3가 사실은 데이터 부족이었음을 보이고, 최적 모델/데이터 비율을 제시.
- 한 줄 요약 — "모델을 키우는 만큼 데이터도 같이 키워야 한다."
- 후속 영향 — LLaMA·Mistral 등 "작지만 데이터 잘 먹은" 효율 모델 시대를 연다.
3장 · 사람 선호로 정렬하기 — RLHF와 그 이후
InstructGPT / RLHF (Ouyang et al., 2022)
- 왜 중요한가 — 사전학습 LLM을 사람의 선호로 미세조정해 "도움이 되고 해롭지 않은" 어시스턴트를 만드는 레시피. ChatGPT의 기술적 기반.
- 한 줄 요약 — "SFT → 보상 모델 학습 → PPO로 정책 최적화."
- 후속 영향 — 모든 대화형 LLM의 표준 학습 절차. "정렬(alignment)"이라는 분야의 실용적 출발점.
Constitutional AI (Bai et al., 2022) — Anthropic
- 왜 중요한가 — 사람의 라벨링 대신 AI 자신이 원칙(헌법)에 따라 자기 출력을 비판·수정하게 함. 인간 라벨 비용을 줄이고, 더 일관된 안전성을 추구.
- 한 줄 요약 — "RLHF에서 H(인간)의 상당 부분을 AI로 대체."
- 후속 영향 — Claude의 핵심 학습 방법. RLAIF(AI 피드백) 흐름의 출발점.
DPO (Rafailov et al., 2023) — Direct Preference Optimization
- 왜 중요한가 — RLHF에서 PPO·보상 모델을 거치지 않고 선호 쌍 데이터로 정책을 직접 최적화. 훨씬 간단하고 안정적.
- 한 줄 요약 — "보상 모델 없이, 선호 데이터만으로 정렬."
- 후속 영향 — 오픈소스 미세조정의 사실상 표준. 후속으로 ORPO·KTO 등 변형들이 쏟아진다.
4장 · 추론을 이끌어내기 — Chain-of-Thought부터 o1까지
Chain-of-Thought Prompting (Wei et al., 2022)
- 왜 중요한가 — 단순한 한 줄로 모델의 추론 능력이 극적으로 향상됨을 보임 — "Let's think step by step." 단순한 프롬프트 기법이 새 능력을 깨운다는 첫 강력한 증거.
- 한 줄 요약 — "추론을 단계별로 쓰게 하면 더 잘 푼다."
- 후속 영향 — Tree-of-Thoughts, Self-Consistency, Reflexion 등 "추론 인출" 기법 폭발. 결국 추론 모델(o1)로 이어진다.
Self-Consistency (Wang et al., 2022)
- 왜 중요한가 — 여러 추론 경로를 샘플링하고 다수결로 답을 결정. CoT의 자연스러운 확장.
- 한 줄 요약 — "여러 번 풀게 하고, 가장 자주 나온 답을 채택."
- 후속 영향 — 추론 시간에 연산을 더 써서 정확도를 올리는(test-time compute) 흐름의 초기 사례.
ReAct (Yao et al., 2022)
- 왜 중요한가 — 추론(Reasoning)과 행동(Action)을 인터리브하는 에이전트 패턴. 모델이 "생각 → 도구 호출 → 관찰 → 다시 생각"을 반복.
- 한 줄 요약 — "추론과 도구 사용을 한 루프 안에서."
- 후속 영향 — 거의 모든 AI 에이전트 하네스의 기본 패턴.
OpenAI o1 / o3 시스템 카드 (2024–2025)
- 왜 중요한가 — 추론 시간(test-time compute)을 늘려 강화학습으로 만든 추론 모델. 짧은 답 대신 긴 사고 체인을 생성하고, 자가 검증·수정한다.
- 한 줄 요약 — "더 오래 생각하게 만들면, 더 어려운 문제를 푼다."
- 후속 영향 — DeepSeek-R1, Claude의 thinking 모드, Gemini의 Deep Think 등 추론 모델 경쟁의 시작.
DeepSeek-R1 (DeepSeek-AI, 2025)
- 왜 중요한가 — **순수 강화학습(RLVR — 검증 가능한 보상)**으로 추론 능력을 끌어낼 수 있음을 공개적으로 입증. 오픈 가중치로 공개되어 추론 모델 연구를 가속.
- 한 줄 요약 — "사람 라벨 없이, 검증 가능한 보상만으로 추론을 학습."
- 후속 영향 — 오픈소스 추론 모델·재현 연구의 폭발. "RL은 비싸다"는 통념을 바꿈.
5장 · 효율과 오픈 모델 — LLaMA 시대
LLaMA / LLaMA 2 / LLaMA 3 (Touvron et al., 2023–2024) — Meta
- 왜 중요한가 — 고품질 오픈 가중치 모델의 결정적 등장. Chinchilla 교훈을 실천(작지만 데이터 충분)해, 작은 모델로도 강력한 성능을 보임.
- 한 줄 요약 — "오픈 가중치 + 데이터 잘 먹은 작은 모델."
- 후속 영향 — Mistral, Qwen, Gemma, DeepSeek, Yi 등 오픈 가중치 모델 생태계 전체의 토대. 미세조정 산업의 출발점.
Mixtral 8x7B (Jiang et al., 2024) — Mixture-of-Experts
- 왜 중요한가 — **희소 MoE(Sparse MoE)**가 오픈 가중치로 실용적으로 작동함을 입증. 추론 시 일부 전문가만 활성화해 비용 절감.
- 한 줄 요약 — "총 파라미터는 크고, 활성 파라미터는 작은 모델."
- 후속 영향 — DeepSeek-V3, Qwen3-MoE, GPT-4(루머상 MoE) 등 거의 모든 최첨단 모델이 MoE 방향으로.
FlashAttention (Dao et al., 2022) → FlashAttention-2/3
- 왜 중요한가 — Attention 계산을 GPU 메모리 계층에 맞춰 IO-aware하게 재작성. 학습·추론을 동시에 빠르고 메모리 효율적으로.
- 한 줄 요약 — "Attention을 다시 짜서, 같은 결과 더 싸게."
- 후속 영향 — 사실상 모든 LLM 학습/추론 스택의 기본. PagedAttention(vLLM)·xFormers 등의 발판.
6장 · 컨텍스트 길이·검색·외부 도구
RAG (Lewis et al., 2020) — Retrieval-Augmented Generation
- 왜 중요한가 — LLM에 외부 지식을 검색해 넣어 환각을 줄이고 최신성을 부여. 검색 + 생성 패러다임의 명명.
- 한 줄 요약 — "물어보기 전에 검색해서, 그 컨텍스트로 답하라."
- 후속 영향 — 사실상 모든 엔터프라이즈 LLM 앱의 토대. RAG 자체가 한 산업.
Toolformer (Schick et al., 2023) → Tool/Function Calling
- 왜 중요한가 — LLM이 **외부 도구(API·계산기·검색)**를 호출하는 법을 자기학습. 이후 OpenAI의 function calling, Anthropic의 tool use가 이 흐름을 제품화.
- 한 줄 요약 — "모델이 스스로 'API를 쓸까?'를 결정."
- 후속 영향 — 모든 AI 에이전트의 도구 사용 패러다임. MCP(Model Context Protocol)까지 이어진다.
Lost in the Middle (Liu et al., 2023)
- 왜 중요한가 — 긴 컨텍스트에서 모델이 앞·뒤만 잘 쓰고 중간을 흘린다는 실증. "긴 컨텍스트 = 좋은 컨텍스트"라는 환상을 깸.
- 한 줄 요약 — "컨텍스트 윈도우의 가운데는 거의 안 본다."
- 후속 영향 — 컨텍스트 엔지니어링 분야의 핵심 인용. 검색·재정렬·컨텍스트 압축 연구의 동기.
7장 · 멀티모달
CLIP (Radford et al., 2021)
- 왜 중요한가 — 이미지와 텍스트를 같은 임베딩 공간에 두는 대조학습. 제로샷 이미지 분류, 텍스트→이미지(Stable Diffusion 등)의 기반.
- 한 줄 요약 — "이미지와 캡션을 같은 벡터 공간에 정렬."
- 후속 영향 — DALL·E, Stable Diffusion, CLIP-기반 검색, 거의 모든 VLM의 인코더.
ViT (Dosovitskiy et al., 2020) — Vision Transformer
- 왜 중요한가 — 이미지를 패치 시퀀스로 다뤄 Transformer가 vision에서도 통함을 입증. CNN 독점을 흔든 첫 사건.
- 한 줄 요약 — "이미지를 단어처럼 쪼개서 Transformer에 넣는다."
- 후속 영향 — DETR, Swin, SAM, LLaVA 등 vision·VLM 전체.
LLaVA / GPT-4V — Vision-Language Models
- 왜 중요한가 — LLM에 vision encoder + projection을 붙여 멀티모달 LLM의 실용적 레시피 확립.
- 한 줄 요약 — "이미지 인코더 출력을 LLM의 토큰 공간으로 투영."
- 후속 영향 — Claude 3+ Vision, Gemini, Qwen-VL 등 멀티모달 어시스턴트의 표준 구조.
8장 · 에이전트와 평가
Reflexion (Shinn et al., 2023)
- 왜 중요한가 — 에이전트가 자기 출력을 자가 비판하고 다음 시도에서 반영. 코딩·추론에서 뚜렷한 개선.
- 한 줄 요약 — "실패 → 반성 → 다시 시도."
- 후속 영향 — 자기 수정 루프를 가진 거의 모든 에이전트 하네스.
SWE-bench (Jimenez et al., 2023)
- 왜 중요한가 — LLM의 실제 GitHub 이슈 해결 능력을 측정하는 벤치마크. 토이가 아닌 진짜 코드에서 평가.
- 한 줄 요약 — "벤치마크를 GitHub 이슈로."
- 후속 영향 — SWE-bench Verified가 사실상 코딩 에이전트의 표준 지표. Devin·Cursor·Claude Code 등의 비교 기준.
ARC-AGI / ARC-AGI-2 (Chollet, 2019 / 2025)
- 왜 중요한가 — 데이터로 풀 수 없는 추상 추론 벤치마크. LLM이 단순 패턴 매칭이 아닌 일반화를 하는지 시험.
- 한 줄 요약 — "추상 추론·일반화의 리트머스."
- 후속 영향 — 추론 모델 시대에 다시 부상. ARC-AGI-2는 더 어려워졌다.
9장 · 안전·해석가능성·정렬
Sleeper Agents (Hubinger et al., 2024) — Anthropic
- 왜 중요한가 — 숨겨진 백도어를 가진 모델을 안전성 학습으로 제거할 수 있는가? 결과: 일부 백도어는 학습으로도 제거되지 않는다.
- 한 줄 요약 — "정렬 학습은 백도어를 완전히 못 지운다."
- 후속 영향 — AI 안전성 연구의 경각심. 사전학습 데이터 검증·해석가능성의 중요성을 부각.
Mechanistic Interpretability — Toy Models of Superposition (Elhage et al., 2022) 외
- 왜 중요한가 — 모델 내부 회로를 회로(circuit) 단위로 이해하려는 시도. Anthropic·OpenAI 등의 해석가능성 연구 흐름.
- 한 줄 요약 — "신경망 안에서 무슨 계산이 일어나는지 회로로 본다."
- 후속 영향 — 안전성·디버깅·정렬의 토대로 점차 인정. 2025년 이후 dictionary learning·SAE가 주목.
10장 · 이걸 어떻게 따라잡나 — 실용 가이드
20편을 다 못 읽어도 된다. 다음 전략을 권한다.
우선순위
- 무조건 읽을 것: Attention is All You Need, GPT-3, InstructGPT, RAG, ReAct.
- 개념만 알아두면 되는 것: 나머지 — 위 요약으로 충분.
- 본인 분야 깊이 읽기: 코딩 에이전트면 SWE-bench·Reflexion; vision이면 ViT·CLIP·LLaVA; 추론이면 o1·DeepSeek-R1.
따라잡기 워크플로
- arXiv 일일 다이제스트 구독 (cs.CL / cs.AI). 헤드라인만 보고 1주에 1편 깊이 읽기.
- 블로그·뉴스레터: Anthropic Research, OpenAI Blog, DeepMind Blog, Jay Alammar(시각화), Lilian Weng's Log, Sebastian Raschka, Simon Willison, Latent Space.
- 재현 연구: 인기 논문은 HuggingFace blog·Eugene Yan·Simon Willison이 보통 해설 + 코드를 올린다. 원논문 + 해설을 같이 보는 게 가장 효율적.
- LLM에게 물어보기: 논문 PDF를 모델에 넣고 "이 논문의 핵심 기여 3가지"부터 시작. 단, 환각 주의 — 인용은 항상 원문 확인.
에필로그 — 지도가 있으면, 길을 잃지 않는다
LLM 분야는 빠르다. 그래서 지도가 가치 있다. 새 논문이 나왔을 때 "이게 Chain-of-Thought 후속이구나", "이건 MoE의 변형이구나", "DPO 계열이구나" — 이렇게 위치를 잡을 수 있으면 절반은 이해한 셈이다.
이 20편이 그 좌표계다. 다 깊이 읽지 않아도 된다. 어디 있는지만 알면 된다.
5개 항목 체크리스트
- Attention is All You Need를 한 번이라도 직접 읽었는가?
- RLHF·DPO의 차이를 한 문장으로 설명할 수 있는가?
- CoT·Self-Consistency·o1의 관계가 머릿속에 있는가?
- 본인 분야의 랜드마크 3편은 꼽을 수 있는가?
- 일일 다이제스트나 큐레이션을 하나라도 구독하는가?
참고문헌 (References)
핵심 논문·블로그·페이지 — arXiv 링크는 abstract 페이지로 연결됩니다.
기반 아키텍처
- Vaswani et al., "Attention Is All You Need" (2017): https://arxiv.org/abs/1706.03762
- Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (2018): https://arxiv.org/abs/1810.04805
스케일링·GPT
- Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2, 2019): https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Brown et al., "Language Models are Few-Shot Learners" (GPT-3, 2020): https://arxiv.org/abs/2005.14165
- Kaplan et al., "Scaling Laws for Neural Language Models" (2020): https://arxiv.org/abs/2001.08361
- Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022): https://arxiv.org/abs/2203.15556
정렬 (Alignment)
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022): https://arxiv.org/abs/2203.02155
- Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (2022): https://arxiv.org/abs/2212.08073
- Rafailov et al., "Direct Preference Optimization" (DPO, 2023): https://arxiv.org/abs/2305.18290
추론 (Reasoning)
- Wei et al., "Chain-of-Thought Prompting Elicits Reasoning" (2022): https://arxiv.org/abs/2201.11903
- Wang et al., "Self-Consistency Improves Chain of Thought Reasoning" (2022): https://arxiv.org/abs/2203.11171
- Yao et al., "ReAct: Synergizing Reasoning and Acting" (2022): https://arxiv.org/abs/2210.03629
- OpenAI "Learning to Reason with LLMs" (o1 blog, 2024): https://openai.com/index/learning-to-reason-with-llms/
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability via Reinforcement Learning" (2025): https://arxiv.org/abs/2501.12948
오픈 모델·효율
- Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (2023): https://arxiv.org/abs/2302.13971
- Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023): https://arxiv.org/abs/2307.09288
- Meta AI, "The Llama 3 Herd of Models" (2024): https://arxiv.org/abs/2407.21783
- Jiang et al., "Mixtral of Experts" (2024): https://arxiv.org/abs/2401.04088
- Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention" (2022): https://arxiv.org/abs/2205.14135
검색·도구·컨텍스트
- Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (RAG, 2020): https://arxiv.org/abs/2005.11401
- Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023): https://arxiv.org/abs/2302.04761
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023): https://arxiv.org/abs/2307.03172
멀티모달
- Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021): https://arxiv.org/abs/2103.00020
- Dosovitskiy et al., "An Image is Worth 16x16 Words" (ViT, 2020): https://arxiv.org/abs/2010.11929
- Liu et al., "Visual Instruction Tuning" (LLaVA, 2023): https://arxiv.org/abs/2304.08485
에이전트·평가
- Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (2023): https://arxiv.org/abs/2303.11366
- Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (2023): https://arxiv.org/abs/2310.06770
- Chollet, "On the Measure of Intelligence" (ARC, 2019): https://arxiv.org/abs/1911.01547
- Chollet et al., "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems" (2025): https://arxiv.org/abs/2505.11831
안전·해석가능성
- Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" (2024): https://arxiv.org/abs/2401.05566
- Elhage et al., "Toy Models of Superposition" (Anthropic, 2022): https://transformer-circuits.pub/2022/toy_model/index.html
큐레이션·해설 (정기 구독 추천)
- Anthropic Research: https://www.anthropic.com/research
- OpenAI Research: https://openai.com/research
- Lilian Weng's Log: https://lilianweng.github.io/
- Jay Alammar (visual explanations): https://jalammar.github.io/
- Sebastian Raschka, Ahead of AI: https://magazine.sebastianraschka.com/
- Simon Willison, Weblog: https://simonwillison.net/
- Latent Space (Swyx & Alessio): https://www.latent.space/
- The Gradient: https://thegradient.pub/
"최신 논문보다 중요한 건, 어느 좌표에 그 논문이 위치하는지를 아는 일이다."
— LLM 랜드마크 논문 가이드, 끝.
LLM Landmark Papers Guide — From Attention to GPT, LLaMA, DeepSeek, o1, and Claude (with References, 2026)
Prologue — You don't have to read every paper, but you need the map
The LLM field produces too many papers. Hundreds appear on arXiv every week, and Twitter, blogs, and newsletters all shout "this one is the game-changer." You can't read them all, and not all of them matter.
But landmarks exist — papers that changed every current that came after. Know them and, when a new paper drops, you can see "this is a follow-on to X." Without them, you start from scratch every time.
This piece organizes the 20-odd landmark LLM papers by era and theme. Each paper gets:
- Why it matters — what was first, what it made possible
- TL;DR — the core idea
- Follow-on impact — what streams it fed into
The goal is not "read them all." It's a map. If you know where a paper sits, you can find it precisely when you need it. All arXiv links are collected at the end.
This is a map of papers (ideas and methods), not a catalog of models (GPT-4, Claude, Gemini, etc. — products). Products turn over in six months; ideas last.
Chapter 1 · Foundations — Before and at the start of the Transformer
Attention is All You Need (Vaswani et al., 2017)
- Why it matters — The origin of every modern LLM. Throws out RNNs and LSTMs and presents the Self-Attention–based Transformer: parallelizable and strong on long sequences.
- TL;DR — "Attention alone is enough for sequence modeling — and it works better."
- Follow-on impact — GPT, BERT, T5, LLaMA, Claude — all descendants of this architecture. Even post-2024 non-Transformer attempts (Mamba, RWKV) define themselves against the Transformer.
BERT (Devlin et al., 2018)
- Why it matters — Introduces the bidirectional encoder + masked LM pretraining paradigm. Popularized the "pretrain then fine-tune" workflow that became standard practice in NLP.
- TL;DR — "A Transformer encoder that sees context from both sides at once."
- Follow-on impact — The standard for classification, retrieval, and embedding models. Ancestor of today's embedding models (
text-embedding-3, BGE, Voyage, and so on).
Chapter 2 · Scaling and the GPT lineage
GPT-2 (Radford et al., 2019)
- Why it matters — The discovery that language models are unsupervised multitask learners. Evidence that scaling size and data lets a model perform diverse tasks zero/few-shot, with no per-task fine-tuning.
- TL;DR — "Make it big enough and it does things you never taught it."
- Follow-on impact — The start of the "scaling" paradigm. The road to GPT-3, 4, and 5.
GPT-3 (Brown et al., 2020) — "Language Models are Few-Shot Learners"
- Why it matters — First strong demonstration that in-context learning works. Give the model a few examples in the prompt and it handles a new task without training. 175B parameters.
- TL;DR — "Put examples in the prompt and the model does new tasks."
- Follow-on impact — The field of "prompt engineering" starts here. The direct ancestor of ChatGPT.
Scaling Laws (Kaplan et al., 2020 → Chinchilla, Hoffmann et al., 2022)
- Why it matters — Quantifies how model performance relates to parameter count, data, and compute. Chinchilla showed GPT-3 was actually data-starved, and proposed the optimal model/data ratio.
- TL;DR — "Grow the data as much as you grow the model."
- Follow-on impact — Opens the era of "small but well-fed" efficient models like LLaMA and Mistral.
Chapter 3 · Aligning to human preferences — RLHF and after
InstructGPT / RLHF (Ouyang et al., 2022)
- Why it matters — The recipe for fine-tuning a pretrained LLM on human preferences to produce a "helpful and harmless" assistant. The technical foundation of ChatGPT.
- TL;DR — "SFT, then train a reward model, then optimize the policy with PPO."
- Follow-on impact — Standard training procedure for every conversational LLM. The practical starting point for the field of "alignment."
Constitutional AI (Bai et al., 2022) — Anthropic
- Why it matters — Instead of human labels, the AI critiques and revises its own outputs against a set of principles (a constitution). Cuts human-labeling cost and pursues more consistent safety.
- TL;DR — "Replace much of the H (human) in RLHF with AI."
- Follow-on impact — Claude's core training method. The origin of the RLAIF (AI feedback) line of work.
DPO (Rafailov et al., 2023) — Direct Preference Optimization
- Why it matters — Skips PPO and the reward model in RLHF and optimizes the policy directly on preference-pair data. Much simpler and more stable.
- TL;DR — "Alignment from preference data alone, no reward model."
- Follow-on impact — The de facto standard for open-source fine-tuning. Spawned variants like ORPO and KTO.
Chapter 4 · Eliciting reasoning — from Chain-of-Thought to o1
Chain-of-Thought Prompting (Wei et al., 2022)
- Why it matters — Shows that a single line of prompt — "Let's think step by step" — dramatically improves a model's reasoning. First strong evidence that a simple prompting trick can unlock new capabilities.
- TL;DR — "Make the model write its reasoning step by step and it solves more."
- Follow-on impact — An explosion of "reasoning-elicitation" methods (Tree-of-Thoughts, Self-Consistency, Reflexion). Eventually leads to reasoning models like o1.
Self-Consistency (Wang et al., 2022)
- Why it matters — Sample multiple reasoning paths and decide by majority vote. The natural extension of CoT.
- TL;DR — "Solve it several times, take the most common answer."
- Follow-on impact — An early instance of the "spend more compute at inference time to lift accuracy" (test-time compute) line.
ReAct (Yao et al., 2022)
- Why it matters — An agent pattern that interleaves Reasoning and Action. The model repeats "think → call a tool → observe → think again."
- TL;DR — "Reasoning and tool use in a single loop."
- Follow-on impact — The default pattern in essentially every AI agent harness.
OpenAI o1 / o3 system cards (2024–2025)
- Why it matters — Reasoning models built by scaling test-time compute via reinforcement learning. Instead of short answers, they generate long chains of thought and self-verify or self-correct.
- TL;DR — "Let it think longer and it solves harder problems."
- Follow-on impact — Kicks off the reasoning-model race — DeepSeek-R1, Claude's thinking mode, Gemini's Deep Think, and others.
DeepSeek-R1 (DeepSeek-AI, 2025)
- Why it matters — Publicly demonstrates that reasoning can be elicited via pure reinforcement learning (RLVR — Reinforcement Learning from Verifiable Rewards). Released with open weights, accelerating reasoning-model research.
- TL;DR — "Train reasoning without human labels — only with verifiable rewards."
- Follow-on impact — An explosion of open-source reasoning models and replication work. Overturns the assumption that "RL is expensive."
Chapter 5 · Efficiency and open models — the LLaMA era
LLaMA / LLaMA 2 / LLaMA 3 (Touvron et al., 2023–2024) — Meta
- Why it matters — The decisive arrival of high-quality open-weight models. Putting Chinchilla's lesson into practice (small but well-fed with data), it showed that a small model could be strong.
- TL;DR — "Open weights plus a small model fed enough data."
- Follow-on impact — The foundation of the entire open-weight ecosystem — Mistral, Qwen, Gemma, DeepSeek, Yi, and more. The origin point of the fine-tuning industry.
Mixtral 8x7B (Jiang et al., 2024) — Mixture-of-Experts
- Why it matters — Proves that sparse MoE (Sparse Mixture-of-Experts) can work in practice with open weights. Activates only a subset of experts at inference time, cutting cost.
- TL;DR — "Big total parameters, small active parameters."
- Follow-on impact — Nearly every state-of-the-art model has moved toward MoE — DeepSeek-V3, Qwen3-MoE, and (per rumor) GPT-4.
FlashAttention (Dao et al., 2022) → FlashAttention-2/3
- Why it matters — Rewrites attention to be IO-aware with the GPU memory hierarchy. Faster and more memory-efficient in both training and inference at the same time.
- TL;DR — "Rewrite attention so you get the same result, cheaper."
- Follow-on impact — Effectively the default in every LLM training/inference stack. The springboard for PagedAttention (vLLM), xFormers, and others.
Chapter 6 · Context length, retrieval, and external tools
RAG (Lewis et al., 2020) — Retrieval-Augmented Generation
- Why it matters — Inject retrieved external knowledge into the LLM to reduce hallucinations and add freshness. The naming of the retrieval + generation paradigm.
- TL;DR — "Retrieve first, then answer with that context."
- Follow-on impact — The foundation of nearly every enterprise LLM app. RAG is an industry of its own.
Toolformer (Schick et al., 2023) → Tool/Function Calling
- Why it matters — The LLM teaches itself how to call external tools (APIs, calculators, search). OpenAI's function calling and Anthropic's tool use later productize this line.
- TL;DR — "The model decides for itself: 'Should I use the API?'"
- Follow-on impact — The tool-use paradigm in every AI agent. The line runs all the way to MCP (Model Context Protocol).
Lost in the Middle (Liu et al., 2023)
- Why it matters — Empirical evidence that, in long contexts, models use the beginning and end well but drop the middle. Shatters the illusion that "long context equals good context."
- TL;DR — "Models barely read the middle of the context window."
- Follow-on impact — A central citation in context engineering. The motivation behind retrieval, re-ranking, and context-compression research.
Chapter 7 · Multimodal
CLIP (Radford et al., 2021)
- Why it matters — Contrastive learning that puts images and text in the same embedding space. The basis for zero-shot image classification and text-to-image (Stable Diffusion, etc.).
- TL;DR — "Align images and captions in the same vector space."
- Follow-on impact — DALL-E, Stable Diffusion, CLIP-based search, and the encoder behind nearly every VLM.
ViT (Dosovitskiy et al., 2020) — Vision Transformer
- Why it matters — Treats an image as a sequence of patches and shows that the Transformer works in vision too. The first event to shake the CNN monopoly.
- TL;DR — "Slice an image like words and feed it to a Transformer."
- Follow-on impact — DETR, Swin, SAM, LLaVA — all of vision and the VLM world.
LLaVA / GPT-4V — Vision-Language Models
- Why it matters — Attach a vision encoder plus a projection to an LLM and you get a practical recipe for multimodal LLMs.
- TL;DR — "Project the vision encoder's output into the LLM's token space."
- Follow-on impact — The standard architecture for multimodal assistants — Claude 3+ Vision, Gemini, Qwen-VL, and others.
Chapter 8 · Agents and evaluation
Reflexion (Shinn et al., 2023)
- Why it matters — The agent self-critiques its output and incorporates the critique on the next attempt. Clear gains in coding and reasoning.
- TL;DR — "Fail, reflect, try again."
- Follow-on impact — Nearly every agent harness with a self-correction loop.
SWE-bench (Jimenez et al., 2023)
- Why it matters — A benchmark that measures an LLM's ability to resolve real GitHub issues. Evaluation on real code, not toys.
- TL;DR — "Turn the benchmark into GitHub issues."
- Follow-on impact — SWE-bench Verified is effectively the standard metric for coding agents — the comparison axis for Devin, Cursor, Claude Code, and others.
ARC-AGI / ARC-AGI-2 (Chollet, 2019 / 2025)
- Why it matters — A benchmark for abstract reasoning that cannot be solved with data alone. Tests whether an LLM generalizes or merely pattern-matches.
- TL;DR — "A litmus test for abstract reasoning and generalization."
- Follow-on impact — Resurgent in the reasoning-model era. ARC-AGI-2 raised the bar.
Chapter 9 · Safety, interpretability, and alignment
Sleeper Agents (Hubinger et al., 2024) — Anthropic
- Why it matters — Can safety training remove hidden backdoors from a model? The finding: some backdoors survive standard alignment training.
- TL;DR — "Alignment training cannot fully erase a backdoor."
- Follow-on impact — A wake-up call for AI safety research. Highlights the importance of pretraining-data vetting and interpretability.
Mechanistic Interpretability — Toy Models of Superposition (Elhage et al., 2022) and others
- Why it matters — An attempt to understand the inside of a model in terms of circuits. The interpretability line at Anthropic, OpenAI, and others.
- TL;DR — "See what computation is happening inside a neural network, as circuits."
- Follow-on impact — Gradually recognized as a foundation for safety, debugging, and alignment. From 2025 on, dictionary learning and SAEs (sparse autoencoders) draw attention.
Chapter 10 · How do you keep up — a practical guide
You don't have to read all 20. The strategy I recommend.
Priorities
- Must read: Attention is All You Need, GPT-3, InstructGPT, RAG, ReAct.
- Concept-level is enough: the rest — the summaries above will do.
- Read deeply in your area: coding agents — SWE-bench and Reflexion; vision — ViT, CLIP, and LLaVA; reasoning — o1 and DeepSeek-R1.
A keep-up workflow
- Subscribe to the arXiv daily digest (cs.CL / cs.AI). Skim headlines and read one paper deeply per week.
- Blogs and newsletters: Anthropic Research, OpenAI Blog, DeepMind Blog, Jay Alammar (visualizations), Lilian Weng's Log, Sebastian Raschka, Simon Willison, Latent Space.
- Replications and explainers: popular papers usually get a writeup with code on HuggingFace blog, Eugene Yan, and Simon Willison. Reading the original plus an explainer is the most efficient combo.
- Ask an LLM: drop the PDF into a model and start with "the three core contributions of this paper." Watch for hallucinations — always verify citations against the source.
Epilogue — With a map, you don't get lost
The LLM field moves fast. That's exactly why a map is valuable. When a new paper drops, being able to place it — "this is a CoT follow-on," "this is an MoE variant," "this is in the DPO family" — gets you halfway there.
These 20 papers are your coordinate system. You don't have to read them all deeply. You just have to know where each one sits.
A 5-item checklist
- Have you read Attention is All You Need at least once, yourself?
- Can you explain the difference between RLHF and DPO in one sentence?
- Do you have the relationship among CoT, Self-Consistency, and o1 in your head?
- Can you name three landmarks in your own area?
- Are you subscribed to at least one daily digest or curated source?
References
Core papers, blogs, and pages — arXiv links go to the abstract page.
Foundational architecture
- Vaswani et al., "Attention Is All You Need" (2017): https://arxiv.org/abs/1706.03762
- Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (2018): https://arxiv.org/abs/1810.04805
Scaling and GPT
- Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2, 2019): https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Brown et al., "Language Models are Few-Shot Learners" (GPT-3, 2020): https://arxiv.org/abs/2005.14165
- Kaplan et al., "Scaling Laws for Neural Language Models" (2020): https://arxiv.org/abs/2001.08361
- Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022): https://arxiv.org/abs/2203.15556
Alignment
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022): https://arxiv.org/abs/2203.02155
- Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (2022): https://arxiv.org/abs/2212.08073
- Rafailov et al., "Direct Preference Optimization" (DPO, 2023): https://arxiv.org/abs/2305.18290
Reasoning
- Wei et al., "Chain-of-Thought Prompting Elicits Reasoning" (2022): https://arxiv.org/abs/2201.11903
- Wang et al., "Self-Consistency Improves Chain of Thought Reasoning" (2022): https://arxiv.org/abs/2203.11171
- Yao et al., "ReAct: Synergizing Reasoning and Acting" (2022): https://arxiv.org/abs/2210.03629
- OpenAI "Learning to Reason with LLMs" (o1 blog, 2024): https://openai.com/index/learning-to-reason-with-llms/
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability via Reinforcement Learning" (2025): https://arxiv.org/abs/2501.12948
Open models and efficiency
- Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (2023): https://arxiv.org/abs/2302.13971
- Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023): https://arxiv.org/abs/2307.09288
- Meta AI, "The Llama 3 Herd of Models" (2024): https://arxiv.org/abs/2407.21783
- Jiang et al., "Mixtral of Experts" (2024): https://arxiv.org/abs/2401.04088
- Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention" (2022): https://arxiv.org/abs/2205.14135
Retrieval, tools, and context
- Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (RAG, 2020): https://arxiv.org/abs/2005.11401
- Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023): https://arxiv.org/abs/2302.04761
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023): https://arxiv.org/abs/2307.03172
Multimodal
- Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021): https://arxiv.org/abs/2103.00020
- Dosovitskiy et al., "An Image is Worth 16x16 Words" (ViT, 2020): https://arxiv.org/abs/2010.11929
- Liu et al., "Visual Instruction Tuning" (LLaVA, 2023): https://arxiv.org/abs/2304.08485
Agents and evaluation
- Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (2023): https://arxiv.org/abs/2303.11366
- Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (2023): https://arxiv.org/abs/2310.06770
- Chollet, "On the Measure of Intelligence" (ARC, 2019): https://arxiv.org/abs/1911.01547
- Chollet et al., "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems" (2025): https://arxiv.org/abs/2505.11831
Safety and interpretability
- Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" (2024): https://arxiv.org/abs/2401.05566
- Elhage et al., "Toy Models of Superposition" (Anthropic, 2022): https://transformer-circuits.pub/2022/toy_model/index.html
Curation and explainers (recommended regular reads)
- Anthropic Research: https://www.anthropic.com/research
- OpenAI Research: https://openai.com/research
- Lilian Weng's Log: https://lilianweng.github.io/
- Jay Alammar (visual explanations): https://jalammar.github.io/
- Sebastian Raschka, Ahead of AI: https://magazine.sebastianraschka.com/
- Simon Willison, Weblog: https://simonwillison.net/
- Latent Space (Swyx and Alessio): https://www.latent.space/
- The Gradient: https://thegradient.pub/
"What matters more than the latest paper is knowing where that paper sits on the map."
— LLM Landmark Papers Guide, end.