Split View: GPT 시리즈 논문 완벽 분석: GPT-1부터 GPT-4까지, 언어 모델이 세상을 바꾸기까지의 여정

GPT 시리즈 논문 완벽 분석: GPT-1부터 GPT-4까지, 언어 모델이 세상을 바꾸기까지의 여정

1. GPT 시리즈 개요 및 연대기
2. GPT-1 (2018): Generative Pre-Training의 시작
3. GPT-2 (2019): Zero-shot Learning의 가능성
4. GPT-3 (2020): In-context Learning과 Scaling의 힘
5. InstructGPT / ChatGPT (2022): 인간의 의도에 맞추기
6. GPT-4 (2023): 멀티모달과 예측 가능한 스케일링
7. Scaling Laws 심층 분석
8. 전체 아키텍처 비교
- 8.1 세대별 아키텍처 비교표
- 8.2 패러다임의 진화
9. GPT의 영향: AI 생태계의 변혁
10. 한계점과 비판
11. 정리: GPT가 남긴 유산
12. References
📖 관련 시리즈 & 추천 포스팅
- GitHub

1. GPT 시리즈 개요 및 연대기

GPT(Generative Pre-trained Transformer)는 OpenAI가 2018년부터 발표해온 일련의 대규모 언어 모델(Large Language Model, LLM) 시리즈다. "사전학습된 생성형 Transformer"라는 이름 그대로, Transformer Decoder 아키텍처를 기반으로 대규모 텍스트 데이터에서 비지도 사전학습(Unsupervised Pre-training)을 수행한 뒤, 다양한 Downstream Task에 적용하는 패러다임을 확립했다.

GPT 시리즈는 단순히 모델 크기가 커진 것이 아니라, 각 세대마다 언어 모델의 활용 방식 자체를 재정의했다. 그 여정을 연대순으로 정리하면 다음과 같다.

세대	발표 시기	논문 제목	핵심 키워드	파라미터 수
GPT-1	2018.06	Improving Language Understanding by Generative Pre-Training	Unsupervised Pre-training + Supervised Fine-tuning	117M
GPT-2	2019.02	Language Models are Unsupervised Multitask Learners	Zero-shot Transfer, WebText	1.5B
GPT-3	2020.05	Language Models are Few-Shot Learners	In-context Learning, Scaling Laws	175B
InstructGPT	2022.03	Training Language Models to Follow Instructions with Human Feedback	RLHF, Human Alignment	1.3B~175B
GPT-4	2023.03	GPT-4 Technical Report	Multimodal, Predictable Scaling	비공개

각 세대의 논문 제목 자체가 핵심 메시지를 담고 있다는 점이 인상적이다. GPT-1은 "생성형 사전학습으로 언어 이해를 개선한다"고 선언했고, GPT-2는 "언어 모델은 비지도 멀티태스크 학습기"라고 주장했으며, GPT-3는 "언어 모델은 Few-shot 학습기"라고 한 단계 더 나아갔다. InstructGPT는 "인간 피드백으로 지시를 따르도록 훈련한다"는 실용적 방향을 제시했고, GPT-4는 간결하게 "기술 보고서"로만 발표하며 상업적 전환을 암시했다.

이 글에서는 각 논문의 핵심 기여, 아키텍처 상세, 학습 방법론, 그리고 후속 연구에 미친 영향을 수식과 함께 분석한다.

2. GPT-1 (2018): Generative Pre-Training의 시작

2.1 논문 개요

논문: "Improving Language Understanding by Generative Pre-Training" 저자: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (OpenAI) 발표: 2018년 6월

GPT-1의 핵심 아이디어는 놀랍도록 단순하다. 레이블이 없는 대규모 텍스트로 언어 모델을 사전학습한 뒤, 소량의 레이블 데이터로 특정 태스크에 미세조정(Fine-tuning)한다. 이 두 단계 접근법(Semi-supervised Learning)이 당시 NLP의 판도를 바꿨다.

2018년 당시 NLP는 Task-specific 아키텍처가 지배하던 시대였다. 감정 분석, 질의응답, 텍스트 함의(Textual Entailment) 등 각 태스크마다 별도의 모델을 설계하고, 해당 태스크의 레이블 데이터만으로 학습하는 것이 표준이었다. GPT-1은 이 패러다임에 "범용 사전학습"이라는 새로운 길을 제시했다.

2.2 아키텍처 상세

GPT-1은 Transformer의 Decoder 블록만을 사용하는 아키텍처를 채택했다. 원래 Transformer(Vaswani et al., 2017)는 Encoder-Decoder 구조였지만, GPT-1은 Auto-regressive 언어 모델링에 적합한 Decoder-only 구조를 선택했다.

모델 구성:

레이어 수: 12개의 Transformer Decoder 블록
Hidden Dimension: 768
Attention Head 수: 12 (각 64차원)
Feed-Forward Dimension: 3,072 ( $= 768 \times 4$ )
Context Window: 512 토큰
총 파라미터: 약 117M (1.17억)
활성화 함수: GELU (Gaussian Error Linear Unit)
Positional Encoding: 학습 가능한 위치 임베딩 (Learned Positional Embedding)

원래 Transformer에서 사용한 고정된 Sinusoidal Positional Encoding 대신, GPT-1은 학습 가능한 위치 임베딩을 채택했다. 이는 모델이 위치 정보를 데이터로부터 직접 학습할 수 있게 하여, 다양한 태스크에 더 유연하게 적응할 수 있었다.

2.3 Stage 1: Unsupervised Pre-training

사전학습 단계에서는 레이블이 없는 대규모 텍스트 코퍼스 $\mathcal{U} = \{u_1, u_2, ..., u_n\}$ 에 대해 표준 언어 모델링 목적함수를 최적화한다.

L_1(\mathcal{U}) = \sum_i \log P(u_i \mid u_{i-k}, ..., u_{i-1}; \Theta)

여기서 $k$ 는 컨텍스트 윈도우 크기이고, $\Theta$ 는 모델 파라미터다. 즉, 이전 $k$ 개의 토큰이 주어졌을 때 다음 토큰의 확률을 최대화하는, 전형적인 Auto-regressive Language Modeling이다.

구체적으로 각 토큰의 표현은 다음과 같이 계산된다.

h_0 = UW_e + W_p

h_l = \text{transformer\_block}(h_{l-1}), \quad l \in [1, n]

P(u) = \text{softmax}(h_n W_e^T)

여기서 $U = (u_{-k}, ..., u_{-1})$ 은 컨텍스트 토큰 벡터, $W_e$ 는 토큰 임베딩 행렬, $W_p$ 는 위치 임베딩 행렬이다. 출력 확률은 토큰 임베딩 행렬 $W_e$ 를 재사용(Weight Tying)하여 계산한다.

학습 데이터: BooksCorpus 데이터셋을 사용했다. 약 7,000권의 미출판 도서로 구성되어 있으며, 약 5GB의 텍스트를 포함한다. 장편 텍스트가 많아 Long-range Dependency를 학습하기에 적합했다.

토큰화: BPE(Byte Pair Encoding)를 사용하며, 40,000개의 Merge를 수행하여 어휘를 구성했다.

최적화: Adam Optimizer를 사용했으며, Learning Rate는 처음 2,000 Step 동안 0에서 $2.5 \times 10^{-4}$ 까지 선형 증가(Linear Warmup)한 뒤, Cosine Annealing으로 감소시켰다. Batch Size는 64, 100 Epoch 학습했다.

2.4 Stage 2: Supervised Fine-tuning

사전학습된 모델을 특정 태스크에 적용하기 위해, 레이블 데이터 $\mathcal{C}$ 로 미세조정한다. 입력 토큰 시퀀스 $x_1, ..., x_m$ 에 대응하는 레이블 $y$ 가 주어지면, 다음 목적함수를 최적화한다.

L_2(\mathcal{C}) = \sum_{(x,y)} \log P(y \mid x_1, ..., x_m)

여기서 $P(y \mid x_1, ..., x_m) = \text{softmax}(h_l^m W_y)$ 이고, $h_l^m$ 은 마지막 Transformer 블록의 마지막 토큰 출력, $W_y$ 는 태스크별 Linear Head의 가중치다.

핵심 기법 — Auxiliary Language Modeling Objective: GPT-1은 Fine-tuning 시에도 원래의 언어 모델링 목적함수를 보조 손실로 함께 사용했다. 이는 일반화 성능을 높이고 수렴을 가속화하는 효과가 있었다.

L_3(\mathcal{C}) = L_2(\mathcal{C}) + \lambda \cdot L_1(\mathcal{C})

여기서 $\lambda$ 는 보조 손실의 가중치로, 논문에서는 $\lambda = 0.5$ 를 사용했다.

2.5 Task-specific Input Transformation

GPT-1의 또 다른 중요한 기여는, 다양한 태스크를 하나의 Transformer 아키텍처로 처리하기 위한 입력 변환 기법을 제시한 것이다. 아키텍처 자체를 바꾸지 않고, 입력 형식만 바꿔서 여러 태스크에 적용했다.

Text Classification: [Start] 텍스트 [Extract] 형태로 입력하고, 마지막 토큰의 출력에 Linear Layer를 적용
Textual Entailment: [Start] 전제 [Delimiter] 가설 [Extract] 형태로 두 문장을 연결
Semantic Similarity: 두 문장의 순서를 바꿔 두 가지 입력을 만들고, 각각의 출력을 Element-wise로 합산
Multiple Choice: 각 선택지와 컨텍스트를 개별적으로 연결하여 여러 시퀀스를 만들고, Softmax로 정규화

이 접근법은 모델 아키텍처의 변경을 최소화하면서 다양한 태스크에 적용할 수 있다는 점에서 매우 실용적이었다. 추가 파라미터는 Delimiter 토큰의 임베딩과 최종 Linear Layer의 가중치 $W_y$ 뿐이다.

2.6 실험 결과와 의의

GPT-1은 12개의 NLP 벤치마크 중 9개에서 State-of-the-art를 달성했다. 특히 Commonsense Reasoning(Stories Cloze Test에서 86.5% 정확도), Semantic Similarity(QQP에서 70.3 F1), Question Answering(RACE에서 59.0% 정확도) 등 다양한 태스크에서 기존 모델을 큰 폭으로 능가했다.

하지만 GPT-1의 진정한 의의는 개별 벤치마크 성능이 아니라, "대규모 비지도 사전학습 + 소량 지도 미세조정"이라는 패러다임을 확립한 것이다. 이 패러다임은 이후 BERT, RoBERTa, T5 등으로 이어지며 NLP의 표준이 되었다.

3. GPT-2 (2019): Zero-shot Learning의 가능성

3.1 논문 개요

논문: "Language Models are Unsupervised Multitask Learners" 저자: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever (OpenAI) 발표: 2019년 2월

GPT-2의 논문 제목은 대담한 주장을 담고 있다. "언어 모델은 비지도 멀티태스크 학습기다." 즉, 언어 모델링이라는 단일 목적함수로 학습했을 뿐인데, 별도의 미세조정 없이도 여러 태스크를 수행할 수 있다는 것이다.

GPT-1이 "사전학습 후 미세조정"의 두 단계를 필요로 했다면, GPT-2는 미세조정 없이 Zero-shot으로 태스크를 수행할 수 있음을 보여줬다. 이는 근본적인 패러다임 전환이었다.

3.2 핵심 아이디어: Task as Language Modeling

GPT-2의 핵심 통찰은, 모든 NLP 태스크를 조건부 언어 모델링으로 재구성할 수 있다는 것이다.

기존의 지도학습은 조건부 확률 $P(\text{output} \mid \text{input})$ 을 학습한다. GPT-2는 이를 $P(\text{output} \mid \text{input}, \text{task})$ 형태로 확장하고, task 정보를 자연어로 제공한다.

예를 들어:

번역: (translate to french, english text, french text) 형태의 시퀀스를 자연스러운 텍스트로 표현
요약: 텍스트 뒤에 TL;DR:을 붙여서 요약을 유도
질의응답: 문맥과 질문을 자연어로 제공하면 답변 생성

이 아이디어의 핵심은, 충분히 큰 언어 모델이 충분히 다양한 텍스트를 학습하면, 태스크 수행 능력이 자연스럽게 창발(Emerge)한다는 것이다.

3.3 아키텍처 상세

GPT-2는 GPT-1의 아키텍처를 기본으로 하되, 몇 가지 중요한 수정을 가했다.

주요 변경 사항:

Layer Normalization 위치 변경: 각 Sub-block의 입력 쪽으로 이동 (Pre-norm)
추가 Layer Normalization: 최종 Self-attention 블록 뒤에 추가
Residual 가중치 초기화: 잔차 경로의 가중치를 $1/\sqrt{N}$ 으로 스케일링 ( $N$ 은 Residual Layer 수)
Context Window 확대: 512 → 1,024 토큰
Vocabulary Size 확대: 40,000 → 50,257 (Byte-level BPE)
Batch Size 확대: 64 → 512

GPT-2는 4가지 크기의 모델을 학습했다.

모델	파라미터	레이어	Hidden Dim	Head 수	Head Dim
Small	117M	12	768	12	64
Medium	345M	24	1,024	16	64
Large	762M	36	1,280	20	64
XL	1,542M	48	1,600	25	64

모든 모델에서 Head Dimension은 64로 고정되어 있으며, Feed-forward Layer의 차원은 항상 Hidden Dimension의 4배( $d_{ff} = 4 \times d_{model}$ )라는 패턴이 유지된다.

3.4 WebText 데이터셋

GPT-2의 또 다른 핵심 기여는 WebText라는 새로운 학습 데이터셋이다.

데이터 구축 방법:

Reddit에서 3 Karma 이상을 받은 외부 링크를 수집 (인간이 품질을 검증한 셈)
약 4,500만 개의 링크를 수집
Dragnet과 Newspaper 라이브러리로 HTML에서 텍스트 추출
중복 제거 및 휴리스틱 기반 정제

데이터셋 특성:

약 800만 개의 문서
약 40GB의 텍스트
Wikipedia는 의도적으로 제외 (평가 데이터셋과의 Data Leakage 방지)

WebText의 설계 철학은 "인간의 큐레이션을 활용하되, 명시적 레이블링 비용은 피한다"는 것이었다. Reddit의 Karma 시스템을 일종의 품질 필터로 활용한 아이디어는 이후 많은 데이터셋 구축에 영감을 주었다.

3.5 Byte-level BPE

GPT-2는 토큰화에서도 중요한 혁신을 도입했다. 기존 BPE가 Unicode 문자 수준에서 동작하는 반면, GPT-2는 Byte 수준에서 BPE를 적용했다.

이 접근법의 장점:

완전한 커버리지: 임의의 바이트 시퀀스를 인코딩할 수 있으므로 OOV(Out-of-Vocabulary) 문제가 원천적으로 해결된다
다국어 지원: 별도의 전처리 없이 다양한 언어와 특수 문자를 처리할 수 있다
기본 어휘 크기: 256 (바이트 수) + 특수 토큰

다만 단순한 Byte-level BPE는 비효율적인 Merge를 많이 생성하므로, GPT-2는 다른 카테고리의 문자를 Merge하지 못하도록 하는 규칙을 추가했다. 최종 어휘 크기는 50,257개다.

3.6 Zero-shot 성능과 Scaling

GPT-2의 Zero-shot 성능은 모델 크기에 따라 꾸준히 향상되었다. 이는 이후 Scaling Laws 연구의 전조가 되는 관찰이었다.

주요 Zero-shot 결과:

언어 모델링: 8개의 Language Modeling 벤치마크 중 7개에서 State-of-the-art (WebText에서 학습하지 않은 도메인 포함)
Children's Book Test (Named Entity): 93.3% 정확도 (기존 SOTA 대비 +7%)
LAMBADA: Perplexity 8.6 (기존 SOTA 99.8 대비 대폭 개선)
Reading Comprehension (CoQA): 55.0 F1 (127,000개 학습 데이터를 사용한 기존 모델의 4개 중 3개를 능가)
번역 (WMT14 En-Fr): Zero-shot으로 11.5 BLEU (비지도 번역 기존 SOTA를 근소하게 능가)
요약 (CNN/Daily Mail): TL;DR 프롬프트로 유도, 정성적으로 의미 있는 결과

3.7 "Too Dangerous to Release" 논란

GPT-2는 기술적 성과만큼이나 공개 정책으로도 큰 주목을 받았다. OpenAI는 최초에 1.5B 파라미터 모델을 공개하지 않기로 결정하고, 가장 작은 117M 모델만 공개했다. 이유는 "악의적 사용(Fake news, spam 등)의 위험이 크다"는 것이었다.

이 결정은 AI 커뮤니티에서 격렬한 논쟁을 불러일으켰다.

지지 측 논거:

강력한 텍스트 생성 모델의 무분별한 공개는 허위 정보 대량 생산에 악용될 수 있다
사회적 영향을 고려한 Responsible Disclosure의 선례가 필요하다

비판 측 논거:

1.5B 파라미터 모델의 위험성이 과장되었다
학술 커뮤니티의 재현 가능성을 저해한다
마케팅 목적의 과대포장이라는 의혹

결국 OpenAI는 2019년 11월에 전체 모델을 공개했고, 우려했던 대규모 악용 사례는 발생하지 않았다. 하지만 이 논쟁은 이후 AI Safety와 Responsible AI 논의의 중요한 계기가 되었다.

4. GPT-3 (2020): In-context Learning과 Scaling의 힘

4.1 논문 개요

논문: "Language Models are Few-Shot Learners" 저자: Tom B. Brown, Benjamin Mann, Nick Ryder 외 다수 (OpenAI) 발표: 2020년 5월 (NeurIPS 2020)

GPT-3는 1,750억(175B) 파라미터라는 전례 없는 규모의 언어 모델이다. 그러나 GPT-3의 진정한 혁신은 크기가 아니라, In-context Learning이라는 새로운 패러다임을 확립한 것이다. 모델의 가중치를 전혀 업데이트하지 않고, 프롬프트에 몇 가지 예시를 포함하는 것만으로 다양한 태스크를 수행할 수 있음을 입증했다.

4.2 In-context Learning 패러다임

GPT-3 논문은 세 가지 평가 조건을 체계적으로 비교했다.

Zero-shot: 태스크 설명만 자연어로 제공

Translate English to French:
cheese =>

One-shot: 태스크 설명 + 1개의 예시 제공

Translate English to French:
sea otter => loutre de mer
cheese =>

Few-shot: 태스크 설명 + 10~100개의 예시 제공 (컨텍스트 윈도우가 허용하는 범위 내)

Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe en peluche
cheese =>

이 세 조건 모두 Gradient Update가 전혀 없다. 모델은 순전히 Forward Pass만으로 태스크를 수행한다. 이것이 Fine-tuning과의 결정적 차이다.

In-context Learning이 작동하는 이유에 대한 논문의 해석은, 사전학습 과정에서 모델이 자연스럽게 다양한 태스크 패턴을 학습하게 되고, 프롬프트의 예시가 모델 내부에 이미 존재하는 관련 능력을 "활성화(Locate and Activate)"하는 역할을 한다는 것이다.

4.3 아키텍처 상세

GPT-3는 GPT-2와 기본적으로 동일한 아키텍처를 사용하되, Sparse Transformer(Child et al., 2019)에서 영감을 받아 Dense와 Locally Banded Sparse Attention 패턴을 교차 사용했다.

GPT-3는 8가지 크기의 모델을 학습하여 Scaling 효과를 체계적으로 분석했다.

모델명	파라미터	레이어	$d_{model}$	Head 수	$d_{head}$	Batch Size	Learning Rate
GPT-3 Small	125M	12	768	12	64	0.5M	$6.0 \times 10^{-4}$
GPT-3 Medium	350M	24	1,024	16	64	0.5M	$3.0 \times 10^{-4}$
GPT-3 Large	760M	24	1,536	16	96	0.5M	$2.5 \times 10^{-4}$
GPT-3 XL	1.3B	24	2,048	24	128	1M	$2.0 \times 10^{-4}$
GPT-3 2.7B	2.7B	32	2,560	32	80	1M	$1.6 \times 10^{-4}$
GPT-3 6.7B	6.7B	32	4,096	32	128	2M	$1.2 \times 10^{-4}$
GPT-3 13B	13.0B	40	5,140	40	128	2M	$1.0 \times 10^{-4}$
GPT-3 175B	175.0B	96	12,288	96	128	3.2M	$0.6 \times 10^{-4}$

모든 모델은 2,048 토큰의 Context Window를 사용하며, 총 300B(3,000억) 토큰을 학습했다. 모델이 커질수록 Learning Rate는 낮추고 Batch Size는 키우는 패턴이 일관되게 적용되었다.

4.4 학습 데이터 구성

GPT-3의 학습 데이터는 여러 소스를 혼합한 것으로, 각 소스의 품질에 따라 학습 비중을 차등 적용했다는 점이 특징적이다.

데이터셋	토큰 수 (B)	학습 비중	Epoch
Common Crawl (필터링)	410	60%	0.44
WebText2	19	22%	2.9
Books1	12	8%	1.9
Books2	55	8%	0.43
Wikipedia	3	3%	3.4

주목할 점은 Common Crawl이 전체 토큰의 대부분을 차지하지만, 학습 비중은 60%로 제한한 것이다. 반면 고품질 데이터인 WebText2는 19B 토큰밖에 되지 않지만 22%의 높은 비중을 부여했다. 이는 데이터 품질이 양보다 중요하다는 판단을 반영한다.

Common Crawl 필터링 과정:

고품질 참조 코퍼스(WebText, Books, Wikipedia)와의 유사도를 기반으로 문서 필터링
문서 간 Fuzzy Deduplication 수행
참조 코퍼스를 학습 데이터에 추가하여 최종 구성

4.5 벤치마크 성능

GPT-3 175B의 Few-shot 성능은 다양한 벤치마크에서 인상적이었다.

언어 모델링:

PTB (Penn Treebank): 20.50 Perplexity (Zero-shot SOTA)

Question Answering:

TriviaQA: 71.2% 정확도 (Few-shot, Fine-tuned SOTA 대비 경쟁적)
NaturalQuestions: 29.9% 정확도 (Few-shot)
WebQuestions: 41.5% 정확도 (Few-shot)

번역:

WMT14 En→Fr: 25.2 BLEU (Few-shot)
WMT14 Fr→En: 33.9 BLEU (Few-shot)
WMT16 En→De: 24.3 BLEU (Few-shot)

SuperGLUE:

Few-shot으로 71.8점 달성 (Fine-tuned BERT-Large의 69.0점 능가)
다만 Fine-tuned SOTA(90.0점)에는 미치지 못함

산술 추론:

2자리 덧셈: 100% 정확도
3자리 덧셈: 80.4% 정확도
4~5자리 덧셈: 급격히 하락

이 결과들은 모델 크기가 커질수록, 그리고 제공되는 예시가 많을수록 성능이 향상되는 명확한 Scaling 효과를 보여주었다.

4.6 GPT-3의 한계 인식

논문은 GPT-3의 한계도 솔직하게 기술했다.

텍스트 생성 품질: 긴 문서 생성 시 반복, 일관성 상실, 비논리적 진술 등의 문제 Few-shot의 한계: 자연어 추론(NLI), 일부 Reading Comprehension 태스크에서 Fine-tuning 기반 모델에 미치지 못함 양방향 컨텍스트 부재: Auto-regressive 모델의 본질적 한계로, BERT 등 Bidirectional 모델이 유리한 태스크 존재 Sample Efficiency: 인간은 한두 개의 예시로 새로운 태스크를 학습하지만, GPT-3는 수십~수백 개의 예시가 필요 해석 가능성 부족: 모델의 의사결정 과정을 이해하기 어렵고, In-context Learning의 정확한 메커니즘도 불분명

5. InstructGPT / ChatGPT (2022): 인간의 의도에 맞추기

5.1 논문 개요

논문: "Training Language Models to Follow Instructions with Human Feedback" 저자: Long Ouyang, Jeff Wu, Xu Jiang 외 다수 (OpenAI) 발표: 2022년 3월 (NeurIPS 2022)

GPT-3까지의 언어 모델은 근본적인 문제가 있었다. "다음 토큰 예측"이라는 학습 목적함수와 "사용자의 지시를 유용하고 안전하게 따르기"라는 실제 사용 목적이 일치하지 않았다. 대규모 언어 모델이 아무리 뛰어나도, 질문에 엉뚱한 답을 하거나, 유해한 콘텐츠를 생성하거나, 사실과 다른 정보를 자신 있게 서술하는 문제가 빈번했다.

InstructGPT는 이 Alignment Problem을 RLHF(Reinforcement Learning from Human Feedback)로 해결한 획기적인 연구다. 그리고 이 기술이 바로 ChatGPT의 기반이 되었다.

5.2 Alignment 문제의 정의

논문은 기존 언어 모델의 문제를 세 가지로 분류했다.

Helpfulness (유용성) 부족: 사용자의 지시를 따르지 않고, 관련 없는 텍스트를 생성
Truthfulness (진실성) 부족: 사실과 다른 정보를 생성 (Hallucination)
Harmlessness (무해성) 부족: 유해하거나 편향적인 콘텐츠를 생성

이 세 가지를 합쳐 HHH(Helpful, Honest, Harmless) 기준이라 하며, InstructGPT는 인간의 피드백을 활용하여 이 기준에 맞게 모델을 정렬(Align)하는 것을 목표로 했다.

5.3 RLHF 3단계 파이프라인

InstructGPT의 RLHF 파이프라인은 세 단계로 구성된다.

Step 1: Supervised Fine-Tuning (SFT)

첫 번째 단계는 전통적인 지도학습이다. 인간 라벨러(Labeler)가 프롬프트에 대한 이상적인 응답을 직접 작성하고, 이 데이터로 GPT-3를 미세조정한다.

데이터: 약 13,000개의 (프롬프트, 이상적 응답) 쌍
프롬프트 출처: 라벨러가 직접 작성한 프롬프트 + OpenAI API 사용자가 제출한 프롬프트
학습: 16 Epoch, Cosine Learning Rate Decay

SFT 모델은 기본적인 지시 따르기 능력을 부여하지만, 아직 완전하지 않다. 다음 단계에서 인간의 선호도를 학습한다.

Step 2: Reward Model (RM) Training

두 번째 단계에서는 **인간의 선호도를 수치화하는 보상 모델(Reward Model)**을 학습한다.

데이터 수집 과정:

SFT 모델로 하나의 프롬프트에 대해 $K$ 개의 다른 응답을 생성 ( $K$ 는 4~9)
인간 라벨러가 $K$ 개의 응답을 선호도 순으로 순위(Ranking) 매김
$\binom{K}{2}$ 개의 비교 쌍을 생성

Reward Model 손실 함수:

\text{loss}(\theta) = -\frac{1}{\binom{K}{2}} E_{(x, y_w, y_l) \sim D} \left[ \log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right]

여기서 $r_\theta(x, y)$ 는 프롬프트 $x$ 와 응답 $y$ 에 대한 Reward Model의 스칼라 출력, $y_w$ 는 선호된 응답, $y_l$ 는 비선호된 응답, $\sigma$ 는 Sigmoid 함수다.

이 손실 함수는 Bradley-Terry 모델에 기반한 것으로, 선호된 응답의 보상이 비선호된 응답보다 높도록 학습한다. 하나의 프롬프트에서 $\binom{K}{2}$ 개의 비교 쌍을 만들어 단일 Forward Pass로 계산함으로써 효율성을 높였다.

데이터 규모: 약 33,000개의 프롬프트에서 수집한 비교 데이터
모델 크기: 6B 파라미터 (SFT 모델에서 최종 Unembedding Layer를 제거하고 스칼라 출력 Head를 추가)

Step 3: Reinforcement Learning with PPO

세 번째 단계에서는 학습된 Reward Model을 보상 신호로 사용하여 SFT 모델을 PPO(Proximal Policy Optimization) 알고리즘으로 최적화한다.

PPO 최적화 목적함수:

\text{objective}(\phi) = E_{(x, y) \sim D_{\pi_\phi^{RL}}} \left[ r_\theta(x, y) - \beta \cdot D_{KL}(\pi_\phi^{RL}(y \mid x) \| \pi^{SFT}(y \mid x)) \right]

여기서:

$\pi_\phi^{RL}$ : 현재 학습 중인 RL 정책 (언어 모델)
$\pi^{SFT}$ : SFT 단계에서 얻은 참조 정책
$r_\theta(x, y)$ : Reward Model의 출력
$\beta$ : KL Penalty 계수
$D_{KL}$ : KL Divergence

KL Divergence Penalty의 역할:

KL Divergence 항은 RL 학습 중 모델이 SFT 모델에서 너무 멀어지는 것을 방지한다. 이 제약이 없으면 모델이 Reward Model의 허점을 악용하여 높은 보상을 얻지만 실제로는 무의미한 텍스트를 생성하는 Reward Hacking 현상이 발생할 수 있다.

KL Divergence의 정확한 형태는 다음과 같다.

D_{KL}(\pi_\phi^{RL}(\cdot \mid x) \| \pi^{SFT}(\cdot \mid x)) = \sum_y \pi_\phi^{RL}(y \mid x) \log \frac{\pi_\phi^{RL}(y \mid x)}{\pi^{SFT}(y \mid x)}

실제 구현에서는 이 KL Divergence를 보상에서 직접 차감하는 방식으로 적용한다. 즉, 수정된 보상은 다음과 같다.

R(x, y) = r_\theta(x, y) - \beta \cdot \log \frac{\pi_\phi^{RL}(y \mid x)}{\pi^{SFT}(y \mid x)}

PPO-ptx: Pre-training Mix

InstructGPT는 추가로 PPO-ptx 변형을 제안했다. RL 학습 중에 원래의 사전학습 데이터에 대한 Language Modeling 목적함수를 보조 손실로 혼합한다.

\text{objective}(\phi) = E_{(x, y) \sim D_{\pi_\phi^{RL}}} \left[ r_\theta(x, y) - \beta \cdot D_{KL}(\pi_\phi^{RL} \| \pi^{SFT}) \right] + \gamma \cdot E_{x \sim D_{\text{pretrain}}} \left[ \log \pi_\phi^{RL}(x) \right]

여기서 $\gamma$ 는 사전학습 손실의 가중치다. 이 항은 RL 학습 과정에서 모델의 일반적인 언어 능력이 퇴화하는 것("Alignment Tax")을 방지한다.

5.4 놀라운 결과: 작은 모델이 큰 모델을 이기다

InstructGPT의 가장 놀라운 결과는, 1.3B 파라미터의 InstructGPT가 175B 파라미터의 GPT-3보다 인간 평가에서 선호되었다는 것이다. 파라미터 수가 100배 이상 적은 모델이 더 유용하고, 더 진실하며, 더 무해한 응답을 생성했다.

주요 실험 결과:

인간 평가에서 InstructGPT 출력이 GPT-3 출력 대비 압도적으로 선호됨
공개 NLP 벤치마크에서는 GPT-3와 비슷하거나 약간의 성능 저하 (Alignment Tax)
TruthfulQA에서 PPO 모델은 GPT-3 대비 유의미한 개선
독성(Toxicity) 생성이 GPT-3 대비 약 25% 감소

이 결과는 모델 크기보다 학습 방법론이 중요하다는 것을 보여줬다. "더 크게 만드는 것"이 유일한 답이 아니라, "인간의 의도에 맞게 정렬하는 것"이 핵심이라는 교훈이다.

5.5 InstructGPT에서 ChatGPT로

InstructGPT의 기술은 2022년 11월에 출시된 ChatGPT의 핵심 기반이 되었다. ChatGPT는 GPT-3.5(GPT-3의 개선 버전)에 대화 형태의 RLHF를 적용한 모델이다.

ChatGPT의 출시는 AI 역사의 분기점이었다. 출시 5일 만에 100만 사용자, 2개월 만에 1억 사용자를 달성하며, AI가 일반 대중에게 직접 다가가는 시대를 열었다. InstructGPT 논문의 기술적 기여가 없었다면, 이 혁명은 불가능했을 것이다.

6. GPT-4 (2023): 멀티모달과 예측 가능한 스케일링

6.1 논문 개요

논문: "GPT-4 Technical Report" 저자: OpenAI 발표: 2023년 3월 (arXiv: 2303.08774)

GPT-4 Technical Report는 이전 GPT 논문들과 근본적으로 다르다. 아키텍처, 모델 크기, 학습 데이터, 학습 비용 등 핵심 정보가 대부분 비공개이다. OpenAI는 "경쟁 환경과 안전 고려"를 이유로 이러한 정보를 공개하지 않았다. 이는 Open AI라는 이름과의 괴리를 두고 많은 비판을 받았다.

그럼에도 불구하고, 논문은 몇 가지 중요한 기술적 기여를 담고 있다.

6.2 멀티모달 입력

GPT-4의 가장 눈에 띄는 새로운 능력은 이미지와 텍스트를 동시에 입력으로 받을 수 있다는 것이다. 출력은 여전히 텍스트만 가능하다.

멀티모달 능력의 예시:

이미지에 포함된 텍스트 인식 및 해석
차트와 그래프의 데이터 분석
유머 이미지의 내용 설명과 유머 포인트 해석
과학 다이어그램의 해석과 관련 문제 풀이

이 멀티모달 능력은 이후 GPT-4V(Vision)로 발전하여, 실제 서비스에 적용되었다.

6.3 Predictable Scaling

GPT-4 논문의 가장 중요한 기술적 기여는 Predictable Scaling 방법론이다.

핵심 아이디어는, 작은 모델의 성능으로부터 큰 모델의 성능을 정확하게 예측할 수 있다는 것이다. OpenAI는 GPT-4와 동일한 방법론으로 학습한 소규모 모델들의 성능을 측정하고, 이로부터 GPT-4의 최종 성능을 예측한 뒤, 실제 학습 결과와 비교했다.

Loss 예측: 1,000배에서 10,000배 작은 Compute를 사용하는 모델들의 학습으로부터, GPT-4의 최종 Loss를 Power Law로 예측했다. 실제 학습 결과는 예측과 매우 근접했다.

HumanEval Coding 성능 예측: 코딩 벤치마크에서의 Pass Rate도 소규모 모델의 결과로부터 예측할 수 있었다. 이는 Loss뿐만 아니라 특정 태스크의 성능도 예측 가능하다는 것을 시사한다.

이 Predictable Scaling 방법론의 실용적 가치는 매우 크다. 수천만~수억 달러가 소요되는 대규모 모델 학습에 착수하기 전에, 소규모 실험으로 최종 성능을 예측하여 투자 대비 효과를 사전에 평가할 수 있기 때문이다.

다만 논문은 Inverse Scaling이나 갑작스러운 능력 창발(Emergent Abilities)과 같은 예측하기 어려운 현상도 존재함을 인정했다. 특히 특정 능력이 특정 규모에서 갑자기 나타나는 Emergent Abilities는 Predictable Scaling의 주요 예외 사항이다.

6.4 전문 시험 성능

GPT-4는 인간을 위해 설계된 다양한 전문 시험에서 인상적인 성능을 보였다. 모델은 이 시험들을 위한 별도의 학습을 받지 않았다.

시험	GPT-4 성적/백분위	GPT-3.5 성적/백분위	비고
Uniform Bar Exam (MBE+MEE+MPT)	~298/400 (상위 10%)	~213/400 (하위 10%)	미국 변호사 시험
LSAT	163 (상위 12%)	149 (하위 40%)	로스쿨 입학 시험
SAT Evidence-Based R&W	710/800 (93rd)	670/800 (87th)	미국 대학입학시험
SAT Math	700/800 (89th)	590/800 (70th)	미국 대학입학시험
GRE Quantitative	163/170 (80th)	157/170 (62nd)	대학원 입학 시험
GRE Verbal	169/170 (99th)	154/170 (63rd)	대학원 입학 시험
AP Biology	5 (85~100th)	4 (62~85th)	AP 생물학
AP Chemistry	4 (71~88th)	2 (22~46th)	AP 화학
AP Calculus BC	4 (43~59th)	1 (0~7th)	AP 미적분
AP English Literature	2 (8~22nd)	2 (8~22nd)	AP 영문학

몇 가지 주목할 패턴:

법률, 과학, 수학 분야에서 GPT-3.5 대비 극적인 성능 향상 (Bar Exam: 하위 10% → 상위 10%)
언어/문학 분야에서는 상대적으로 약한 성능 (AP English Literature: 하위 22%)
수학적 추론은 개선되었지만 여전히 상위권은 아님 (AP Calculus BC: 43~59th percentile)

6.5 Safety와 Alignment 개선

GPT-4는 안전성 면에서도 GPT-3.5 대비 크게 개선되었다.

RLHF 기반 안전 훈련:

학습 과정에 **추가 안전 보상 신호(Safety Reward Signal)**를 도입
GPT-4 Zero-shot Classifier를 활용하여 안전 경계와 응답 스타일을 판단
허용/비허용 카테고리 모두에 안전 보상을 적용하여, 유효한 요청의 과잉 거부를 방지

정량적 개선:

비허용 콘텐츠 요청에 대한 응답 비율이 GPT-3.5 대비 82% 감소
민감한 요청(의료 조언, 자해 등)에 대한 정책 준수율이 29% 향상
내부 적대적 사실성(Adversarial Factuality) 평가에서 GPT-3.5 대비 40% 높은 점수
TruthfulQA에서 RLHF 후 약 60% → 80%로 향상

전문가 Red-teaming:

50명 이상의 도메인 전문가(AI 안전, 사이버보안, 생물학적 위험, 국제 안보 등)가 적대적 테스트에 참여
고위험 시나리오(자율적 복제, 화학/생물 무기 정보 등)에 대한 평가

6.6 GPT-4의 한계

논문에서 명시적으로 인정한 한계는 다음과 같다.

Hallucination: 여전히 사실과 다른 정보를 "자신 있게" 생성할 수 있다. RLHF로 크게 개선되었지만 완전히 해결되지는 않았다.
Context Window 제한: 학습 당시 8K/32K 토큰으로 제한되어 매우 긴 문서 처리에 한계가 있다.
학습 데이터 Cutoff: 학습 데이터의 시점 이후 정보를 모른다 (2021년 9월까지의 데이터로 학습).
추론의 불완전성: 복잡한 다단계 추론에서 실수할 수 있으며, 특히 수학적 증명이나 코드의 미묘한 버그에서 오류가 발생한다.
편향과 보정: 사회적 편향이 완전히 제거되지 않았으며, 모델의 확신도(Confidence)가 실제 정확도와 반드시 일치하지 않는다.

7. Scaling Laws 심층 분석

7.1 Kaplan Scaling Laws (2020)

GPT-3와 동시기에 OpenAI의 Jared Kaplan 등이 발표한 "Scaling Laws for Neural Language Models"은 대규모 언어 모델 연구의 이론적 기반을 제공했다.

핵심 발견 — Power Law 관계:

언어 모델의 Cross-entropy Loss $L$ 은 모델 파라미터 수 $N$ , 데이터셋 크기 $D$ , 학습에 사용된 Compute $C$ 와 각각 Power Law 관계를 가진다.

L(N) \propto N^{-\alpha_N}, \quad \alpha_N \approx 0.076

L(D) \propto D^{-\alpha_D}, \quad \alpha_D \approx 0.095

L(C) \propto C^{-\alpha_C}, \quad \alpha_C \approx 0.050

이 관계는 7 Orders of Magnitude 이상에 걸쳐 성립하며, 매우 안정적인 추세선을 보인다.

Compute-optimal Allocation (Kaplan 버전):

고정된 Compute Budget $C$ 에서 Loss를 최소화하려면, 모델 크기를 키우되 데이터는 상대적으로 적게 사용하는 것이 최적이라고 결론지었다. 구체적으로, Compute가 10배 증가하면 모델 크기는 5.5배 키우고 데이터는 1.8배만 늘리는 것이 효율적이라고 했다.

N_{\text{opt}} \propto C^{0.73}, \quad D_{\text{opt}} \propto C^{0.27}

이 결과는 "모델을 키우는 것이 데이터를 늘리는 것보다 효율적"이라는 해석으로 이어졌고, GPT-3의 175B 파라미터라는 스케일을 정당화하는 근거가 되었다.

7.2 Chinchilla Scaling Laws (2022)

Kaplan의 Scaling Laws에 대한 중요한 수정이 2022년 DeepMind의 "Training Compute-Optimal Large Language Models"(일명 Chinchilla 논문)에서 제시되었다.

핵심 발견: 기존 모델은 Under-trained다.

Chinchilla 논문은 Kaplan의 결론과 달리, 모델 크기와 학습 데이터를 거의 동등한 비율로 늘려야 한다고 주장했다. 구체적으로, 파라미터 1개당 약 20개의 학습 토큰이 Compute-optimal이라는 것이다.

N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}

이 기준으로 보면, GPT-3(175B 파라미터, 300B 토큰)는 학습 데이터가 부족했다. Compute-optimal으로 학습하려면 약 3.5T(3조 5천억) 토큰이 필요했다.

Chinchilla vs. GPT-3:

항목	GPT-3	Chinchilla
파라미터	175B	70B
학습 토큰	300B	1.4T
토큰/파라미터 비율	1.7	20
MMLU 성능	70.0%	73.4%
Compute	~3,640 PF-days	~5,200 PF-days

Chinchilla는 GPT-3보다 2.5배 작은 모델이지만, 4.7배 많은 데이터를 학습하여 더 높은 성능을 달성했다. 이 결과는 이후의 대규모 모델 학습 방향에 근본적인 영향을 미쳤다.

7.3 Scaling Laws가 GPT-4에 미친 영향

GPT-4의 Predictable Scaling은 이러한 Scaling Laws 연구의 직접적인 응용이다. 작은 모델의 Loss가 Power Law를 따른다면, 그 추세선을 외삽(Extrapolate)하여 큰 모델의 Loss를 예측할 수 있다.

GPT-4 논문이 보여준 것은 이 예측이 놀라울 정도로 정확하다는 것이다. 이는 Scaling Laws가 단순한 경험적 관찰이 아니라, 언어 모델의 학습 과정에 대한 깊은 구조적 특성을 반영하고 있음을 시사한다.

다만 이 예측 가능성에는 중요한 한계가 있다.

Loss ≠ Capability: 전체 Loss의 감소가 특정 능력의 향상과 직결되지 않을 수 있다
Emergent Abilities: 특정 규모에서 갑자기 나타나는 능력은 Power Law로 예측하기 어렵다
Inverse Scaling: 일부 태스크에서는 모델이 커질수록 오히려 성능이 하락하는 현상이 관찰된다
Task-specific Variability: 태스크에 따라 Scaling 효율이 크게 다르다

8. 전체 아키텍처 비교

8.1 세대별 아키텍처 비교표

항목	GPT-1	GPT-2 (XL)	GPT-3 (175B)	InstructGPT	GPT-4
발표 시기	2018.06	2019.02	2020.05	2022.03	2023.03
파라미터 수	117M	1,542M	175,000M	1,300M~175,000M	비공개
레이어 수	12	48	96	96 (175B 기준)	비공개
Hidden Dim	768	1,600	12,288	12,288 (175B 기준)	비공개
Attention Head 수	12	25	96	96 (175B 기준)	비공개
Head Dimension	64	64	128	128 (175B 기준)	비공개
Context Window	512	1,024	2,048	2,048	8,192 / 32,768
Vocabulary Size	40,000	50,257	50,257	50,257	~100,000 (추정)
학습 데이터	BooksCorpus (5GB)	WebText (40GB)	혼합 (570GB)	GPT-3 + 인간 피드백	비공개
학습 토큰 수	~1B (추정)	~10B (추정)	300B	300B + RLHF	비공개
토큰화	BPE (40K merges)	Byte-level BPE	Byte-level BPE	Byte-level BPE	비공개
Positional Enc.	Learned	Learned	Learned	Learned	비공개
활성화 함수	GELU	GELU	GELU	GELU	비공개
LayerNorm	Post-norm	Pre-norm	Pre-norm	Pre-norm	비공개
학습 방법	LM + Fine-tuning	LM only	LM only	LM + SFT + RLHF	LM + SFT + RLHF
멀티모달	No	No	No	No	Yes (Image Input)
Sparse Attention	No	No	Yes (부분적)	Yes (부분적)	비공개

8.2 패러다임의 진화

아키텍처 자체보다 중요한 것은 패러다임의 진화다.

GPT-1: Pre-train → Fine-tune (각 태스크마다 미세조정 필요)
         ↓
GPT-2: Pre-train → Zero-shot (미세조정 없이 직접 사용)
         ↓
GPT-3: Pre-train → In-context Learning (예시만으로 태스크 수행)
         ↓
InstructGPT: Pre-train → SFT → RLHF (인간 피드백으로 정렬)
         ↓
GPT-4: Pre-train → SFT → RLHF + Multimodal (멀티모달 + 안전성 강화)

이 진화의 일관된 방향은 사용자의 개입을 줄이는 것이다. GPT-1은 각 태스크마다 학습 데이터와 미세조정이 필요했지만, GPT-4에 이르러서는 자연어 지시만으로 거의 모든 태스크를 수행할 수 있게 되었다.

9. GPT의 영향: AI 생태계의 변혁

9.1 ChatGPT와 AI 대중화

GPT 시리즈의 가장 직접적인 영향은 ChatGPT를 통한 AI의 대중화다.

ChatGPT 성장 지표:

2022년 11월 30일 출시
5일 만에 100만 사용자
2개월 만에 1억 사용자 (역대 최빠른 기록, TikTok의 9개월을 압도)
2024년 말 기준 주간 활성 사용자 7억 명 이상

ChatGPT는 "AI"라는 개념을 연구자와 개발자의 전유물에서 일반 대중의 일상 도구로 전환시켰다. 이 전환은 InstructGPT의 RLHF 기술 없이는 불가능했다.

9.2 API Economy와 AI-native 서비스

GPT-3의 API 공개(2020년 6월)는 AI API Economy의 시작을 알렸다.

새로운 비즈니스 모델:

Wrapper 서비스: GPT API 위에 특화된 UX를 구축 (Jasper, Copy.ai 등)
Vertical AI: 특정 도메인에 최적화된 AI 솔루션 (Harvey for Law, Hippocratic AI for Healthcare)
AI-augmented SaaS: 기존 SaaS에 AI 기능을 통합 (Notion AI, GitHub Copilot 등)
Agent Framework: GPT를 핵심 추론 엔진으로 활용하는 자율 에이전트 (AutoGPT, LangChain 등)

9.3 학술적 영향

GPT 시리즈는 학술 연구의 방향에도 근본적인 영향을 미쳤다.

새로운 연구 분야의 탄생:

Prompt Engineering: In-context Learning의 효과를 극대화하는 프롬프트 설계 연구
Alignment Research: RLHF를 넘어선 다양한 정렬 기법 (DPO, ORPO, Constitutional AI 등)
Mechanistic Interpretability: 대규모 모델 내부의 작동 원리를 이해하려는 연구
Scaling Laws: 모델 성능과 자원 간의 관계를 정량적으로 분석하는 연구
Evaluation: 기존 벤치마크의 한계를 인식하고 새로운 평가 방법론을 개발하는 연구

연구 방법론의 변화:

"모델 아키텍처 혁신" 중심에서 "데이터, 학습 방법, 정렬" 중심으로 연구 초점 이동
Compute 요구량 증가로 인한 학술 연구와 산업 연구의 격차 심화
Open-source 모델(LLaMA, Mistral 등)의 등장으로 학술 접근성 부분적 회복

9.4 산업과 사회에 미친 영향

교육: AI 튜터, 자동 채점, 맞춤형 학습 콘텐츠 생성
의료: 의료 문서 작성 보조, 진단 지원, 약물 상호작용 분석
법률: 판례 검색, 계약서 분석, 법률 자문 초안 작성
소프트웨어 개발: 코드 생성, 디버깅, 문서화 (GitHub Copilot)
콘텐츠 창작: 글쓰기 보조, 번역, 요약, 아이디어 생성

10. 한계점과 비판

10.1 Hallucination (환각)

GPT 시리즈의 가장 심각한 한계는 사실과 다른 정보를 확신에 차서 생성하는 Hallucination 문제다.

Hallucination의 유형:

사실적 오류: 존재하지 않는 인용, 잘못된 통계, 가짜 역사적 사실
논리적 비약: 전제에서 결론으로의 비약적 추론
자기 모순: 같은 대화 내에서 상반되는 주장

근본 원인:

Auto-regressive 모델은 "그럴듯한 다음 토큰"을 생성할 뿐, 사실 여부를 검증하지 않는다
학습 데이터에 오류가 포함되어 있으며, 모델이 이를 구분하지 못한다
RLHF가 "자신 있게 말하는 것"을 보상하여 오히려 자신감 있는 오류를 조장할 수 있다

GPT-4는 RLHF를 통해 Hallucination을 GPT-3.5 대비 약 40% 감소시켰지만, 완전한 해결은 아직 요원하다. 이는 현재 LLM 연구의 가장 활발한 분야 중 하나다.

10.2 Bias (편향)

대규모 언어 모델은 학습 데이터에 내재된 사회적 편향을 반영하고 때로는 증폭한다.

편향의 유형:

성별 편향: 직업, 성격 특성 등에서의 고정관념 반영
인종/민족 편향: 특정 인종에 대한 부정적 연관성
문화적 편향: 영어권, 특히 미국 중심의 세계관
사회경제적 편향: 특정 계층의 관점 과대 대표

GPT-3 논문은 이를 명시적으로 인정하고, Gender, Race, Religion 관련 편향 분석을 포함했다. InstructGPT과 GPT-4는 RLHF를 통해 편향을 줄이려 했지만, 학습 데이터 자체의 편향을 완전히 제거하는 것은 근본적으로 어려운 문제다.

10.3 Environmental Cost (환경 비용)

대규모 모델 학습의 환경적 비용은 점점 더 큰 우려가 되고 있다.

학습 탄소 배출 추정:

GPT-3: 약 552 톤 CO2e (미국 평균 자동차 약 120대의 1년 배출량에 해당)
GPT-4: 약 15,000 톤 CO2e로 추정 (비공식 추정치, GPT-3의 약 27배)

수자원 소비:

Microsoft는 GPT-3 학습 과정에서 약 700,000 리터의 담수를 데이터센터 냉각에 사용한 것으로 보고됨

비판과 반론:

단일 학습의 비용이 크지만, 학습된 모델은 수억 명이 사용하므로 1인당 비용은 미미하다는 반론
모델 효율화(Distillation, Quantization, Pruning)와 하드웨어 발전으로 비용이 감소하고 있다
그러나 Jevons Paradox(효율 향상이 오히려 총 소비 증가를 유발)의 우려도 존재

10.4 투명성과 재현 가능성

GPT 시리즈에 대한 가장 지속적인 비판 중 하나는 투명성 부족이다.

GPT-1: 논문, 코드, 모델 공개 (비교적 개방적)
GPT-2: 논문 공개, 모델은 단계적 공개 ("too dangerous" 논란)
GPT-3: 논문 공개, 모델은 API로만 접근 가능
GPT-4: 아키텍처, 데이터, 학습 비용 등 핵심 정보 비공개

이 추세는 "Open" AI라는 조직 이름과의 괴리를 심화시켰고, 학술 커뮤니티의 재현 가능성(Reproducibility)을 심각하게 저해했다. 이에 대한 반발로 Meta의 LLaMA, Mistral AI의 Mistral/Mixtral 등 개방형 모델의 중요성이 더욱 부각되었다.

10.5 경제적 불평등과 Compute Divide

대규모 모델 학습에 필요한 자원의 집중은 AI 연구의 경제적 불평등을 심화시킨다.

GPT-3 학습 비용: 약 460만 달러 (추정)
GPT-4 학습 비용: 약 1억 달러 이상 (추정)
이 규모의 투자는 소수의 대형 기업만 가능하며, 대학과 소규모 연구소는 구조적으로 배제된다

11. 정리: GPT가 남긴 유산

GPT 시리즈의 5편의 논문을 관통하는 핵심 통찰을 정리하면 다음과 같다.

1. Scale is (almost) all you need

GPT-1(117M) → GPT-2(1.5B) → GPT-3(175B)로의 Scaling은 단순히 "같은 것을 더 크게"가 아니라, 질적으로 새로운 능력의 창발로 이어졌다. Zero-shot, In-context Learning, 복잡한 추론 등은 충분한 규모에서만 나타나는 Emergent Abilities였다.

2. Alignment changes everything

InstructGPT는 모델 크기보다 학습 방법론이 중요할 수 있음을 보여줬다. 1.3B InstructGPT가 175B GPT-3를 이긴 것은, 단순한 능력(Capability)과 유용성(Usefulness) 사이에 큰 간극이 있으며, RLHF가 이 간극을 메울 수 있음을 입증했다.

3. The bitter lesson revisited

Rich Sutton의 "The Bitter Lesson" — 범용적인 방법 + 더 많은 Compute가 특화된 방법을 이긴다 — 이 GPT 시리즈에서 반복적으로 확인되었다. Task-specific 아키텍처 대신 범용 Transformer + 대규모 사전학습이 압도적으로 효과적이었다.

4. Data is the new bottleneck

Chinchilla의 교훈이후, 모델 크기와 함께 학습 데이터의 양과 질이 핵심 병목으로 부상했다. 인터넷의 고품질 텍스트는 유한하며, Synthetic Data 생성이 새로운 연구 방향으로 부상하고 있다.

5. Safety is not optional

GPT-2의 "too dangerous to release" 논란부터 GPT-4의 Red-teaming까지, 안전성은 선택이 아닌 필수가 되었다. AI 모델이 강력해질수록 안전하고 책임감 있는 개발의 중요성도 비례하여 커진다.

GPT 시리즈는 아직 끝나지 않았다. GPT-5, 그리고 그 너머의 모델이 어떤 능력을 보여줄지는 알 수 없지만, 한 가지는 분명하다. GPT 시리즈가 확립한 "대규모 사전학습 + 인간 피드백 정렬"의 패러다임은 현대 AI의 근간으로 자리잡았고, 이를 이해하는 것은 AI의 미래를 이해하는 데 필수적이라는 것이다.

12. References

GPT-1: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI Paper
GPT-2: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Paper
GPT-3: Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. arXiv:2005.14165
InstructGPT: Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. arXiv:2203.02155
GPT-4: OpenAI. (2023). "GPT-4 Technical Report." arXiv:2303.08774
Scaling Laws: Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361
Chinchilla: Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556
Transformer: Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
PPO: Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347
RLHF: Christiano, P. F., Leike, J., Brown, T., et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. arXiv:1706.03741
Sparse Transformer: Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv:1904.10509
BPE: Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. arXiv:1508.07909
Carbon Footprint: Patterson, D., Gonzalez, J., Le, Q., et al. (2021). "Carbon Emissions and Large Neural Network Training." arXiv:2104.10350

📖 관련 시리즈 & 추천 포스팅

나만의 GPT 만들기 — nanoGPT로 처음부터 학습 — 직접 GPT를 코딩해보자
AI를 위한 수학 완전 가이드 — Transformer 이해에 필요한 수학
Attention Is All You Need 분석 — Transformer 원논문
BERT 분석 — GPT와 쌍벽을 이루는 Encoder 모델
RWKV: Reinventing RNNs — Transformer 대안 아키텍처
vLLM 추론 최적화 — GPT 모델 서빙
LLM 양자화 GPTQ/AWQ/GGUF — 대형 모델 경량화

GitHub

nanoGPT — Andrej Karpathy
ai-model-analysis — 본 블로그의 코드 레벨 분석 모음

Complete Analysis of the GPT Series Papers: The Journey from GPT-1 to GPT-4, How Language Models Changed the World

1. GPT Series Overview and Timeline
2. GPT-1 (2018): The Beginning of Generative Pre-Training
3. GPT-2 (2019): The Possibility of Zero-shot Learning
4. GPT-3 (2020): The Power of In-context Learning and Scaling
5. InstructGPT / ChatGPT (2022): Aligning with Human Intent
6. GPT-4 (2023): Multimodal and Predictable Scaling
7. In-depth Analysis of Scaling Laws
8. Overall Architecture Comparison
- 8.1 Generation-by-Generation Architecture Comparison Table
- 8.2 Evolution of Paradigms
9. GPT's Impact: Transformation of the AI Ecosystem
10. Limitations and Criticisms
11. Summary: The Legacy of GPT
12. References
Related Series and Recommended Posts
- GitHub

1. GPT Series Overview and Timeline

GPT (Generative Pre-trained Transformer) is a series of Large Language Models (LLMs) published by OpenAI since 2018. True to its name "Generative Pre-trained Transformer," it established the paradigm of performing unsupervised pre-training on large-scale text data based on the Transformer Decoder architecture, then applying it to various downstream tasks.

The GPT series did not simply grow in model size -- each generation redefined how language models are utilized. The journey in chronological order is as follows:

Generation	Release	Paper Title	Key Keywords	Parameters
GPT-1	2018.06	Improving Language Understanding by Generative Pre-Training	Unsupervised Pre-training + Supervised Fine-tuning	117M
GPT-2	2019.02	Language Models are Unsupervised Multitask Learners	Zero-shot Transfer, WebText	1.5B
GPT-3	2020.05	Language Models are Few-Shot Learners	In-context Learning, Scaling Laws	175B
InstructGPT	2022.03	Training Language Models to Follow Instructions with Human Feedback	RLHF, Human Alignment	1.3B~175B
GPT-4	2023.03	GPT-4 Technical Report	Multimodal, Predictable Scaling	Undisclosed

It is noteworthy that each generation's paper title carries its core message. GPT-1 declared "improving language understanding through generative pre-training," GPT-2 claimed "language models are unsupervised multitask learners," and GPT-3 went a step further with "language models are few-shot learners." InstructGPT presented a practical direction of "training to follow instructions with human feedback," and GPT-4 was simply published as a "technical report," hinting at its commercial transition.

In this article, we analyze each paper's key contributions, architecture details, training methodology, and impact on subsequent research, together with equations.

2. GPT-1 (2018): The Beginning of Generative Pre-Training

2.1 Paper Overview

Paper: "Improving Language Understanding by Generative Pre-Training" Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (OpenAI) Released: June 2018

The core idea of GPT-1 is surprisingly simple. Pre-train a language model on large-scale unlabeled text, then fine-tune it on a specific task with a small amount of labeled data. This two-stage approach (Semi-supervised Learning) transformed the NLP landscape at the time.

In 2018, NLP was dominated by task-specific architectures. Designing separate models for each task such as sentiment analysis, question answering, and textual entailment, and training them only on task-specific labeled data was standard. GPT-1 proposed a new path of "general-purpose pre-training" to this paradigm.

2.2 Architecture Details

GPT-1 adopted an architecture using only the Decoder blocks of the Transformer. While the original Transformer (Vaswani et al., 2017) had an Encoder-Decoder structure, GPT-1 chose a Decoder-only structure suitable for auto-regressive language modeling.

Model Configuration:

Number of Layers: 12 Transformer Decoder blocks
Hidden Dimension: 768
Number of Attention Heads: 12 (64 dimensions each)
Feed-Forward Dimension: 3,072 ( $= 768 \times 4$ )
Context Window: 512 tokens
Total Parameters: Approximately 117M (117 million)
Activation Function: GELU (Gaussian Error Linear Unit)
Positional Encoding: Learned Positional Embedding

Instead of the fixed Sinusoidal Positional Encoding used in the original Transformer, GPT-1 adopted learned positional embeddings. This allowed the model to learn positional information directly from data, enabling more flexible adaptation to various tasks.

2.3 Stage 1: Unsupervised Pre-training

In the pre-training stage, the standard language modeling objective is optimized over a large-scale unlabeled text corpus $\mathcal{U} = \{u_1, u_2, ..., u_n\}$ .

L_1(\mathcal{U}) = \sum_i \log P(u_i \mid u_{i-k}, ..., u_{i-1}; \Theta)

Here, $k$ is the context window size and $\Theta$ represents the model parameters. This is a typical Auto-regressive Language Modeling objective that maximizes the probability of the next token given the previous $k$ tokens.

Specifically, each token's representation is computed as follows:

h_0 = UW_e + W_p

h_l = \text{transformer\_block}(h_{l-1}), \quad l \in [1, n]

P(u) = \text{softmax}(h_n W_e^T)

Here, $U = (u_{-k}, ..., u_{-1})$ is the context token vector, $W_e$ is the token embedding matrix, and $W_p$ is the position embedding matrix. Output probabilities are computed by reusing the token embedding matrix $W_e$ (Weight Tying).

Training Data: The BooksCorpus dataset was used, consisting of approximately 7,000 unpublished books containing about 5GB of text. The abundance of long-form text made it suitable for learning long-range dependencies.

Tokenization: BPE (Byte Pair Encoding) was used with 40,000 merges to construct the vocabulary.

Optimization: The Adam Optimizer was used with a learning rate that linearly increased from 0 to $2.5 \times 10^{-4}$ during the first 2,000 steps (Linear Warmup), then decreased with Cosine Annealing. Batch Size was 64, trained for 100 epochs.

2.4 Stage 2: Supervised Fine-tuning

To apply the pre-trained model to a specific task, it is fine-tuned with labeled data $\mathcal{C}$ . Given an input token sequence $x_1, ..., x_m$ with corresponding label $y$ , the following objective is optimized:

L_2(\mathcal{C}) = \sum_{(x,y)} \log P(y \mid x_1, ..., x_m)

Here, $P(y \mid x_1, ..., x_m) = \text{softmax}(h_l^m W_y)$ , where $h_l^m$ is the last token output of the final Transformer block, and $W_y$ is the weight of the task-specific Linear Head.

Key Technique -- Auxiliary Language Modeling Objective: GPT-1 also used the original language modeling objective as an auxiliary loss during fine-tuning. This had the effect of improving generalization performance and accelerating convergence.

L_3(\mathcal{C}) = L_2(\mathcal{C}) + \lambda \cdot L_1(\mathcal{C})

Here, $\lambda$ is the weight of the auxiliary loss, and the paper used $\lambda = 0.5$ .

2.5 Task-specific Input Transformation

Another important contribution of GPT-1 was presenting input transformation techniques to handle various tasks with a single Transformer architecture. Without changing the architecture itself, it adapted to multiple tasks by only changing the input format.

Text Classification: Input as [Start] text [Extract] and apply a Linear Layer to the last token's output
Textual Entailment: Connect two sentences as [Start] premise [Delimiter] hypothesis [Extract]
Semantic Similarity: Reverse the order of two sentences to create two inputs, and element-wise add their outputs
Multiple Choice: Individually concatenate each choice with context to create multiple sequences, and normalize with Softmax

This approach was very practical in that it could be applied to various tasks with minimal changes to the model architecture. The only additional parameters were the delimiter token embeddings and the final Linear Layer weights $W_y$ .

2.6 Experimental Results and Significance

GPT-1 achieved state-of-the-art results on 9 out of 12 NLP benchmarks. In particular, it significantly outperformed existing models in Commonsense Reasoning (86.5% accuracy on Stories Cloze Test), Semantic Similarity (70.3 F1 on QQP), and Question Answering (59.0% accuracy on RACE).

However, the true significance of GPT-1 lies not in individual benchmark performance but in establishing the paradigm of "large-scale unsupervised pre-training + small-scale supervised fine-tuning." This paradigm continued with BERT, RoBERTa, T5, and others, becoming the standard in NLP.

3. GPT-2 (2019): The Possibility of Zero-shot Learning

3.1 Paper Overview

Paper: "Language Models are Unsupervised Multitask Learners" Authors: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever (OpenAI) Released: February 2019

The paper title of GPT-2 carries a bold claim: "Language models are unsupervised multitask learners." That is, despite being trained with a single objective of language modeling, the model can perform multiple tasks without separate fine-tuning.

While GPT-1 required two stages of "pre-training then fine-tuning," GPT-2 demonstrated that tasks can be performed zero-shot without fine-tuning. This was a fundamental paradigm shift.

3.2 Core Idea: Task as Language Modeling

The core insight of GPT-2 is that all NLP tasks can be reformulated as conditional language modeling.

Traditional supervised learning learns the conditional probability $P(\text{output} \mid \text{input})$ . GPT-2 extends this to the form $P(\text{output} \mid \text{input}, \text{task})$ , providing task information in natural language.

For example:

Translation: Expressing sequences of the form (translate to french, english text, french text) as natural text
Summarization: Appending TL;DR: after text to elicit a summary
Question Answering: Providing context and questions in natural language to generate answers

The key to this idea is that if a sufficiently large language model learns sufficiently diverse text, task performance capabilities naturally emerge.

3.3 Architecture Details

GPT-2 is based on the GPT-1 architecture with several important modifications.

Key Changes:

Layer Normalization Position Change: Moved to the input side of each sub-block (Pre-norm)
Additional Layer Normalization: Added after the final Self-attention block
Residual Weight Initialization: Residual path weights scaled by $1/\sqrt{N}$ ( $N$ is the number of Residual Layers)
Context Window Expansion: 512 to 1,024 tokens
Vocabulary Size Expansion: 40,000 to 50,257 (Byte-level BPE)
Batch Size Expansion: 64 to 512

GPT-2 trained four model sizes:

Model	Parameters	Layers	Hidden Dim	Heads	Head Dim
Small	117M	12	768	12	64
Medium	345M	24	1,024	16	64
Large	762M	36	1,280	20	64
XL	1,542M	48	1,600	25	64

The Head Dimension is fixed at 64 across all models, and the Feed-forward Layer dimension is always 4 times the Hidden Dimension ( $d_{ff} = 4 \times d_{model}$ ).

3.4 WebText Dataset

Another key contribution of GPT-2 is the WebText training dataset.

Data Construction Method:

Collected external links with 3+ Karma on Reddit (effectively human-vetted quality)
Collected approximately 45 million links
Extracted text from HTML using Dragnet and Newspaper libraries
Deduplication and heuristic-based cleaning

Dataset Characteristics:

Approximately 8 million documents
Approximately 40GB of text
Wikipedia was intentionally excluded (to prevent data leakage with evaluation datasets)

The design philosophy of WebText was "leverage human curation while avoiding explicit labeling costs." The idea of using Reddit's Karma system as a quality filter inspired many subsequent dataset constructions.

3.5 Byte-level BPE

GPT-2 also introduced an important innovation in tokenization. While existing BPE operates at the Unicode character level, GPT-2 applied BPE at the byte level.

Advantages of this approach:

Complete Coverage: Since any byte sequence can be encoded, the OOV (Out-of-Vocabulary) problem is fundamentally eliminated
Multilingual Support: Various languages and special characters can be processed without separate preprocessing
Base Vocabulary Size: 256 (number of bytes) + special tokens

However, since naive byte-level BPE generates many inefficient merges, GPT-2 added rules to prevent merging characters of different categories. The final vocabulary size is 50,257.

3.6 Zero-shot Performance and Scaling

GPT-2's zero-shot performance consistently improved with model size. This was a precursor to the later Scaling Laws research.

Key Zero-shot Results:

Language Modeling: State-of-the-art on 7 out of 8 Language Modeling benchmarks (including domains not in WebText training)
Children's Book Test (Named Entity): 93.3% accuracy (+7% over previous SOTA)
LAMBADA: Perplexity 8.6 (drastically improved from previous SOTA of 99.8)
Reading Comprehension (CoQA): 55.0 F1 (surpassing 3 out of 4 existing models trained with 127,000 examples)
Translation (WMT14 En-Fr): 11.5 BLEU zero-shot (slightly surpassing unsupervised translation SOTA)
Summarization (CNN/Daily Mail): Elicited with TL;DR prompt, qualitatively meaningful results

3.7 "Too Dangerous to Release" Controversy

GPT-2 received as much attention for its release policy as for its technical achievements. OpenAI initially decided not to release the 1.5B parameter model, releasing only the smallest 117M model. The reason was "the risk of malicious use (fake news, spam, etc.) is significant."

This decision sparked intense debate in the AI community.

Supporting Arguments:

Unrestricted release of powerful text generation models could be exploited for mass production of disinformation
A precedent for Responsible Disclosure considering societal impact was needed

Critical Arguments:

The danger of the 1.5B parameter model was exaggerated
It hinders reproducibility in the academic community
Suspicions of marketing-driven exaggeration

Eventually, OpenAI released the full model in November 2019, and the feared large-scale misuse did not materialize. However, this debate became an important catalyst for subsequent AI Safety and Responsible AI discussions.

4. GPT-3 (2020): The Power of In-context Learning and Scaling

4.1 Paper Overview

Paper: "Language Models are Few-Shot Learners" Authors: Tom B. Brown, Benjamin Mann, Nick Ryder and many others (OpenAI) Released: May 2020 (NeurIPS 2020)

GPT-3 is a language model of unprecedented scale with 175 billion (175B) parameters. However, the true innovation of GPT-3 is not its size but establishing the new paradigm of In-context Learning. It proved that various tasks can be performed without updating the model weights at all, simply by including a few examples in the prompt.

4.2 In-context Learning Paradigm

The GPT-3 paper systematically compared three evaluation conditions.

Zero-shot: Only a task description provided in natural language

Translate English to French:
cheese =>

One-shot: Task description + 1 example provided

Translate English to French:
sea otter => loutre de mer
cheese =>

Few-shot: Task description + 10-100 examples provided (within the context window limit)

Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe en peluche
cheese =>

All three conditions involve absolutely no gradient updates. The model performs tasks purely through forward passes. This is the decisive difference from fine-tuning.

The paper's interpretation of why in-context learning works is that during pre-training, the model naturally learns various task patterns, and the examples in the prompt serve to "locate and activate" relevant abilities that already exist within the model.

4.3 Architecture Details

GPT-3 uses essentially the same architecture as GPT-2, but inspired by Sparse Transformer (Child et al., 2019), alternates between Dense and Locally Banded Sparse Attention patterns.

GPT-3 trained 8 model sizes to systematically analyze scaling effects.

Model Name	Parameters	Layers	$d_{model}$	Heads	$d_{head}$	Batch Size	Learning Rate
GPT-3 Small	125M	12	768	12	64	0.5M	$6.0 \times 10^{-4}$
GPT-3 Medium	350M	24	1,024	16	64	0.5M	$3.0 \times 10^{-4}$
GPT-3 Large	760M	24	1,536	16	96	0.5M	$2.5 \times 10^{-4}$
GPT-3 XL	1.3B	24	2,048	24	128	1M	$2.0 \times 10^{-4}$
GPT-3 2.7B	2.7B	32	2,560	32	80	1M	$1.6 \times 10^{-4}$
GPT-3 6.7B	6.7B	32	4,096	32	128	2M	$1.2 \times 10^{-4}$
GPT-3 13B	13.0B	40	5,140	40	128	2M	$1.0 \times 10^{-4}$
GPT-3 175B	175.0B	96	12,288	96	128	3.2M	$0.6 \times 10^{-4}$

All models use a 2,048 token context window and were trained on a total of 300B (300 billion) tokens. A consistent pattern of decreasing learning rate and increasing batch size with larger models was applied.

4.4 Training Data Composition

GPT-3's training data is a mixture of multiple sources, with the notable characteristic of applying differential training weights based on each source's quality.

Dataset	Tokens (B)	Training Weight	Epoch
Common Crawl (filtered)	410	60%	0.44
WebText2	19	22%	2.9
Books1	12	8%	1.9
Books2	55	8%	0.43
Wikipedia	3	3%	3.4

A notable point is that while Common Crawl accounts for most of the tokens, its training weight is limited to 60%. In contrast, the high-quality WebText2 with only 19B tokens is given a high weight of 22%. This reflects the judgment that data quality is more important than quantity.

Common Crawl Filtering Process:

Document filtering based on similarity with high-quality reference corpora (WebText, Books, Wikipedia)
Fuzzy deduplication between documents
Adding reference corpora to the training data for the final composition

4.5 Benchmark Performance

GPT-3 175B's few-shot performance was impressive across various benchmarks.

Language Modeling:

PTB (Penn Treebank): 20.50 Perplexity (Zero-shot SOTA)

Question Answering:

TriviaQA: 71.2% accuracy (Few-shot, competitive with Fine-tuned SOTA)
NaturalQuestions: 29.9% accuracy (Few-shot)
WebQuestions: 41.5% accuracy (Few-shot)

Translation:

WMT14 En to Fr: 25.2 BLEU (Few-shot)
WMT14 Fr to En: 33.9 BLEU (Few-shot)
WMT16 En to De: 24.3 BLEU (Few-shot)

SuperGLUE:

Achieved 71.8 points with Few-shot (surpassing Fine-tuned BERT-Large at 69.0)
However, did not reach Fine-tuned SOTA (90.0 points)

Arithmetic Reasoning:

2-digit addition: 100% accuracy
3-digit addition: 80.4% accuracy
4-5 digit addition: rapid decline

These results demonstrated a clear scaling effect where performance improves with larger model size and more provided examples.

4.6 GPT-3's Recognized Limitations

The paper also candidly described GPT-3's limitations.

Text Generation Quality: Issues with repetition, loss of coherence, and illogical statements during long document generation Limitations of Few-shot: Underperforming fine-tuning-based models on natural language inference (NLI) and some reading comprehension tasks Absence of Bidirectional Context: An inherent limitation of auto-regressive models, with tasks where bidirectional models like BERT have advantages Sample Efficiency: While humans learn new tasks from one or two examples, GPT-3 requires tens to hundreds of examples Lack of Interpretability: Difficulty understanding the model's decision-making process, and the exact mechanism of in-context learning remains unclear

5. InstructGPT / ChatGPT (2022): Aligning with Human Intent

5.1 Paper Overview

Paper: "Training Language Models to Follow Instructions with Human Feedback" Authors: Long Ouyang, Jeff Wu, Xu Jiang and many others (OpenAI) Released: March 2022 (NeurIPS 2022)

Language models up to GPT-3 had a fundamental problem: the training objective of "next token prediction" did not align with the actual use purpose of "following user instructions usefully and safely." No matter how capable a large language model was, it frequently gave irrelevant answers to questions, generated harmful content, or confidently stated inaccurate information.

InstructGPT is a groundbreaking study that solved this Alignment Problem with RLHF (Reinforcement Learning from Human Feedback). And this technology became the foundation of ChatGPT.

5.2 Definition of the Alignment Problem

The paper classified the problems of existing language models into three categories:

Lack of Helpfulness: Not following user instructions and generating irrelevant text
Lack of Truthfulness: Generating factually incorrect information (Hallucination)
Lack of Harmlessness: Generating harmful or biased content

These three combined form the HHH (Helpful, Honest, Harmless) criteria, and InstructGPT aimed to align the model to these criteria using human feedback.

5.3 RLHF 3-Stage Pipeline

InstructGPT's RLHF pipeline consists of three stages.

Step 1: Supervised Fine-Tuning (SFT)

The first stage is traditional supervised learning. Human labelers directly write ideal responses to prompts, and GPT-3 is fine-tuned with this data.

Data: Approximately 13,000 (prompt, ideal response) pairs
Prompt Sources: Prompts written by labelers + prompts submitted by OpenAI API users
Training: 16 epochs, Cosine Learning Rate Decay

The SFT model provides basic instruction-following capability, but it is not yet complete. The next stage learns human preferences.

Step 2: Reward Model (RM) Training

In the second stage, a Reward Model that quantifies human preferences is trained.

Data Collection Process:

Generate $K$ different responses for one prompt using the SFT model ( $K$ ranges from 4 to 9)
Human labelers rank the $K$ responses by preference
Generate $\binom{K}{2}$ comparison pairs

Reward Model Loss Function:

\text{loss}(\theta) = -\frac{1}{\binom{K}{2}} E_{(x, y_w, y_l) \sim D} \left[ \log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right]

Here, $r_\theta(x, y)$ is the scalar output of the Reward Model for prompt $x$ and response $y$ , $y_w$ is the preferred response, $y_l$ is the non-preferred response, and $\sigma$ is the Sigmoid function.

This loss function is based on the Bradley-Terry model, training so that the reward of the preferred response is higher than the non-preferred response. Efficiency was improved by creating $\binom{K}{2}$ comparison pairs from a single prompt and computing them in a single forward pass.

Data Scale: Comparison data collected from approximately 33,000 prompts
Model Size: 6B parameters (removing the final unembedding layer from the SFT model and adding a scalar output head)

Step 3: Reinforcement Learning with PPO

In the third stage, the SFT model is optimized using the PPO (Proximal Policy Optimization) algorithm with the trained Reward Model as the reward signal.

PPO Optimization Objective:

\text{objective}(\phi) = E_{(x, y) \sim D_{\pi_\phi^{RL}}} \left[ r_\theta(x, y) - \beta \cdot D_{KL}(\pi_\phi^{RL}(y \mid x) \| \pi^{SFT}(y \mid x)) \right]

Where:

$\pi_\phi^{RL}$ : The RL policy being trained (language model)
$\pi^{SFT}$ : Reference policy from the SFT stage
$r_\theta(x, y)$ : Reward Model output
$\beta$ : KL Penalty coefficient
$D_{KL}$ : KL Divergence

Role of KL Divergence Penalty:

The KL Divergence term prevents the model from straying too far from the SFT model during RL training. Without this constraint, the model can exploit loopholes in the Reward Model to obtain high rewards while actually generating meaningless text -- a phenomenon known as Reward Hacking.

The exact form of the KL Divergence is:

D_{KL}(\pi_\phi^{RL}(\cdot \mid x) \| \pi^{SFT}(\cdot \mid x)) = \sum_y \pi_\phi^{RL}(y \mid x) \log \frac{\pi_\phi^{RL}(y \mid x)}{\pi^{SFT}(y \mid x)}

In practice, this KL Divergence is applied by directly subtracting it from the reward. That is, the modified reward is:

R(x, y) = r_\theta(x, y) - \beta \cdot \log \frac{\pi_\phi^{RL}(y \mid x)}{\pi^{SFT}(y \mid x)}

PPO-ptx: Pre-training Mix

InstructGPT additionally proposed the PPO-ptx variant, which mixes the language modeling objective on the original pre-training data as an auxiliary loss during RL training.

\text{objective}(\phi) = E_{(x, y) \sim D_{\pi_\phi^{RL}}} \left[ r_\theta(x, y) - \beta \cdot D_{KL}(\pi_\phi^{RL} \| \pi^{SFT}) \right] + \gamma \cdot E_{x \sim D_{\text{pretrain}}} \left[ \log \pi_\phi^{RL}(x) \right]

Here, $\gamma$ is the weight of the pre-training loss. This term prevents the degradation of the model's general language capabilities during RL training ("Alignment Tax").

5.4 Remarkable Result: Small Model Beats Large Model

InstructGPT's most remarkable result is that 1.3B parameter InstructGPT was preferred over 175B parameter GPT-3 in human evaluations. A model with more than 100 times fewer parameters generated more useful, more truthful, and more harmless responses.

Key Experimental Results:

InstructGPT outputs overwhelmingly preferred over GPT-3 outputs in human evaluation
Similar or slightly lower performance compared to GPT-3 on public NLP benchmarks (Alignment Tax)
Significant improvement of PPO model over GPT-3 on TruthfulQA
Approximately 25% reduction in toxicity generation compared to GPT-3

This result showed that training methodology matters more than model size. "Making it bigger" is not the only answer -- "aligning it with human intent" is the key lesson.

5.5 From InstructGPT to ChatGPT

InstructGPT's technology became the core foundation of ChatGPT, released in November 2022. ChatGPT is a model that applied conversational RLHF to GPT-3.5 (an improved version of GPT-3).

ChatGPT's release was a turning point in AI history. Reaching 1 million users in 5 days and 100 million users in 2 months, it ushered in an era where AI directly reached the general public. Without InstructGPT's technical contributions, this revolution would have been impossible.

6. GPT-4 (2023): Multimodal and Predictable Scaling

6.1 Paper Overview

Paper: "GPT-4 Technical Report" Authors: OpenAI Released: March 2023 (arXiv: 2303.08774)

The GPT-4 Technical Report is fundamentally different from previous GPT papers. Most key information including architecture, model size, training data, and training costs is undisclosed. OpenAI cited "competitive landscape and safety considerations" as reasons for not disclosing this information. This was widely criticized for the disconnect with the "Open" in OpenAI.

Nevertheless, the paper contains several important technical contributions.

6.2 Multimodal Input

The most notable new capability of GPT-4 is that it can accept both images and text as input simultaneously. Output is still limited to text only.

Examples of Multimodal Capabilities:

Recognition and interpretation of text within images
Data analysis of charts and graphs
Description of humor images and interpretation of their humor
Interpretation of scientific diagrams and solving related problems

This multimodal capability later evolved into GPT-4V (Vision) and was applied to actual services.

6.3 Predictable Scaling

The most important technical contribution of the GPT-4 paper is the Predictable Scaling methodology.

The core idea is that the performance of a large model can be accurately predicted from the performance of small models. OpenAI measured the performance of smaller models trained with the same methodology as GPT-4, predicted GPT-4's final performance from this, and compared it with actual training results.

Loss Prediction: From the training of models using 1,000x to 10,000x less compute, GPT-4's final loss was predicted using a Power Law. The actual training result was very close to the prediction.

HumanEval Coding Performance Prediction: The pass rate on a coding benchmark could also be predicted from smaller model results. This suggests that not only loss but specific task performance is predictable.

The practical value of this Predictable Scaling methodology is immense. Before committing to large-scale model training costing tens of millions to hundreds of millions of dollars, small-scale experiments can predict the final performance to evaluate return on investment in advance.

However, the paper acknowledged that phenomena such as inverse scaling and sudden emergent abilities are hard to predict. In particular, emergent abilities -- where specific capabilities suddenly appear at a certain scale -- are a major exception to Predictable Scaling.

6.4 Professional Exam Performance

GPT-4 demonstrated impressive performance on various professional exams designed for humans. The model received no specific training for these exams.

Exam	GPT-4 Score/Percentile	GPT-3.5 Score/Percentile	Note
Uniform Bar Exam (MBE+MEE+MPT)	~298/400 (top 10%)	~213/400 (bottom 10%)	US Bar Exam
LSAT	163 (top 12%)	149 (bottom 40%)	Law School Admission
SAT Evidence-Based R&W	710/800 (93rd)	670/800 (87th)	US College Admission
SAT Math	700/800 (89th)	590/800 (70th)	US College Admission
GRE Quantitative	163/170 (80th)	157/170 (62nd)	Graduate Admission
GRE Verbal	169/170 (99th)	154/170 (63rd)	Graduate Admission
AP Biology	5 (85~100th)	4 (62~85th)	AP Biology
AP Chemistry	4 (71~88th)	2 (22~46th)	AP Chemistry
AP Calculus BC	4 (43~59th)	1 (0~7th)	AP Calculus
AP English Literature	2 (8~22nd)	2 (8~22nd)	AP English Literature

Notable patterns:

Dramatic performance improvement over GPT-3.5 in law, science, and mathematics (Bar Exam: bottom 10% to top 10%)
Relatively weak performance in language/literature (AP English Literature: bottom 22%)
Mathematical reasoning improved but still not top-tier (AP Calculus BC: 43~59th percentile)

6.5 Safety and Alignment Improvements

GPT-4 was significantly improved in safety compared to GPT-3.5.

RLHF-based Safety Training:

Introduced additional safety reward signals in the training process
Used GPT-4 Zero-shot Classifier to judge safety boundaries and response styles
Applied safety rewards to both allowed/disallowed categories to prevent over-refusal of valid requests

Quantitative Improvements:

82% reduction in response rate to disallowed content requests compared to GPT-3.5
29% improvement in policy compliance for sensitive requests (medical advice, self-harm, etc.)
40% higher score on internal adversarial factuality evaluation compared to GPT-3.5
Improvement from approximately 60% to 80% on TruthfulQA after RLHF

Expert Red-teaming:

Over 50 domain experts (AI safety, cybersecurity, biological risks, international security, etc.) participated in adversarial testing
Evaluation of high-risk scenarios (autonomous replication, chemical/biological weapons information, etc.)

6.6 GPT-4's Limitations

The limitations explicitly acknowledged in the paper are:

Hallucination: Can still "confidently" generate factually incorrect information. Greatly improved by RLHF but not fully resolved.
Context Window Limitation: Limited to 8K/32K tokens at training time, limiting very long document processing.
Training Data Cutoff: Does not know information after the training data cutoff (trained on data up to September 2021).
Incomplete Reasoning: Can make mistakes in complex multi-step reasoning, especially in mathematical proofs and subtle code bugs.
Bias and Calibration: Social biases have not been fully removed, and the model's confidence does not necessarily match actual accuracy.

7. In-depth Analysis of Scaling Laws

7.1 Kaplan Scaling Laws (2020)

"Scaling Laws for Neural Language Models" published by Jared Kaplan and others at OpenAI contemporaneously with GPT-3 provided the theoretical foundation for large language model research.

Key Finding -- Power Law Relationships:

The cross-entropy loss $L$ of a language model has a Power Law relationship with the number of model parameters $N$ , dataset size $D$ , and compute $C$ used for training.

L(N) \propto N^{-\alpha_N}, \quad \alpha_N \approx 0.076

L(D) \propto D^{-\alpha_D}, \quad \alpha_D \approx 0.095

L(C) \propto C^{-\alpha_C}, \quad \alpha_C \approx 0.050

These relationships hold over more than 7 orders of magnitude and show very stable trend lines.

Compute-optimal Allocation (Kaplan Version):

To minimize loss with a fixed compute budget $C$ , the conclusion was that it is optimal to increase model size while using relatively less data. Specifically, when compute increases 10x, it is most efficient to increase model size by 5.5x and data by only 1.8x.

N_{\text{opt}} \propto C^{0.73}, \quad D_{\text{opt}} \propto C^{0.27}

This result led to the interpretation that "increasing model size is more efficient than increasing data," and served as justification for GPT-3's 175B parameter scale.

7.2 Chinchilla Scaling Laws (2022)

An important correction to Kaplan's Scaling Laws was presented in DeepMind's 2022 "Training Compute-Optimal Large Language Models" (known as the Chinchilla paper).

Key Finding: Existing models are under-trained.

Unlike Kaplan's conclusion, the Chinchilla paper argued that model size and training data should be increased at nearly equal rates. Specifically, approximately 20 training tokens per parameter is compute-optimal.

N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}

By this criterion, GPT-3 (175B parameters, 300B tokens) was data-starved. Compute-optimal training would have required approximately 3.5T (3.5 trillion) tokens.

Chinchilla vs. GPT-3:

Item	GPT-3	Chinchilla
Parameters	175B	70B
Training Tokens	300B	1.4T
Token/Parameter Ratio	1.7	20
MMLU Performance	70.0%	73.4%
Compute	~3,640 PF-days	~5,200 PF-days

Chinchilla is a 2.5x smaller model than GPT-3 but achieved higher performance by training on 4.7x more data. This result fundamentally influenced the direction of subsequent large-scale model training.

7.3 Impact of Scaling Laws on GPT-4

GPT-4's Predictable Scaling is a direct application of this Scaling Laws research. If the loss of small models follows a Power Law, then the trend line can be extrapolated to predict the loss of large models.

What the GPT-4 paper showed is that this prediction is surprisingly accurate. This suggests that Scaling Laws are not merely empirical observations but reflect deep structural properties of the language model training process.

However, there are important limitations to this predictability:

Loss is not equal to Capability: Reduction in overall loss may not directly translate to improvement in specific abilities
Emergent Abilities: Abilities that suddenly appear at a certain scale are difficult to predict with Power Laws
Inverse Scaling: In some tasks, performance decreases as the model grows larger
Task-specific Variability: Scaling efficiency varies significantly across tasks

8. Overall Architecture Comparison

8.1 Generation-by-Generation Architecture Comparison Table

Item	GPT-1	GPT-2 (XL)	GPT-3 (175B)	InstructGPT	GPT-4
Release Date	2018.06	2019.02	2020.05	2022.03	2023.03
Parameters	117M	1,542M	175,000M	1,300M~175,000M	Undisclosed
Layers	12	48	96	96 (175B basis)	Undisclosed
Hidden Dim	768	1,600	12,288	12,288 (175B basis)	Undisclosed
Attention Heads	12	25	96	96 (175B basis)	Undisclosed
Head Dimension	64	64	128	128 (175B basis)	Undisclosed
Context Window	512	1,024	2,048	2,048	8,192 / 32,768
Vocabulary Size	40,000	50,257	50,257	50,257	~100,000 (est.)
Training Data	BooksCorpus (5GB)	WebText (40GB)	Mixed (570GB)	GPT-3 + Human Feedback	Undisclosed
Training Tokens	~1B (est.)	~10B (est.)	300B	300B + RLHF	Undisclosed
Tokenization	BPE (40K merges)	Byte-level BPE	Byte-level BPE	Byte-level BPE	Undisclosed
Positional Enc.	Learned	Learned	Learned	Learned	Undisclosed
Activation	GELU	GELU	GELU	GELU	Undisclosed
LayerNorm	Post-norm	Pre-norm	Pre-norm	Pre-norm	Undisclosed
Training Method	LM + Fine-tuning	LM only	LM only	LM + SFT + RLHF	LM + SFT + RLHF
Multimodal	No	No	No	No	Yes (Image Input)
Sparse Attention	No	No	Yes (partial)	Yes (partial)	Undisclosed

8.2 Evolution of Paradigms

More important than the architecture itself is the evolution of paradigms.

GPT-1: Pre-train -> Fine-tune (fine-tuning required for each task)
         |
GPT-2: Pre-train -> Zero-shot (direct use without fine-tuning)
         |
GPT-3: Pre-train -> In-context Learning (task performance with examples only)
         |
InstructGPT: Pre-train -> SFT -> RLHF (alignment with human feedback)
         |
GPT-4: Pre-train -> SFT -> RLHF + Multimodal (multimodal + enhanced safety)

The consistent direction of this evolution is reducing user intervention. GPT-1 required training data and fine-tuning for each task, but by GPT-4, nearly all tasks can be performed with natural language instructions alone.

9. GPT's Impact: Transformation of the AI Ecosystem

9.1 ChatGPT and AI Democratization

The most direct impact of the GPT series is the democratization of AI through ChatGPT.

ChatGPT Growth Metrics:

Released November 30, 2022
1 million users in 5 days
100 million users in 2 months (fastest record ever, surpassing TikTok's 9 months)
Over 700 million weekly active users by end of 2024

ChatGPT transformed the concept of "AI" from an exclusive domain of researchers and developers to an everyday tool for the general public. This transformation would have been impossible without InstructGPT's RLHF technology.

9.2 API Economy and AI-native Services

GPT-3's API release (June 2020) marked the beginning of the AI API Economy.

New Business Models:

Wrapper Services: Building specialized UX on top of the GPT API (Jasper, Copy.ai, etc.)
Vertical AI: AI solutions optimized for specific domains (Harvey for Law, Hippocratic AI for Healthcare)
AI-augmented SaaS: Integrating AI features into existing SaaS (Notion AI, GitHub Copilot, etc.)
Agent Frameworks: Autonomous agents using GPT as the core reasoning engine (AutoGPT, LangChain, etc.)

9.3 Academic Impact

The GPT series also had a fundamental impact on the direction of academic research.

Birth of New Research Fields:

Prompt Engineering: Research on prompt design to maximize the effectiveness of in-context learning
Alignment Research: Various alignment techniques beyond RLHF (DPO, ORPO, Constitutional AI, etc.)
Mechanistic Interpretability: Research to understand the internal workings of large models
Scaling Laws: Quantitative analysis of the relationship between model performance and resources
Evaluation: Recognizing the limitations of existing benchmarks and developing new evaluation methodologies

Changes in Research Methodology:

Shift in research focus from "model architecture innovation" to "data, training methods, alignment"
Growing gap between academic and industrial research due to increasing compute requirements
Partial recovery of academic accessibility through open-source models (LLaMA, Mistral, etc.)

9.4 Impact on Industry and Society

Education: AI tutors, automated grading, personalized learning content generation
Healthcare: Medical document writing assistance, diagnostic support, drug interaction analysis
Law: Case search, contract analysis, legal advice drafting
Software Development: Code generation, debugging, documentation (GitHub Copilot)
Content Creation: Writing assistance, translation, summarization, idea generation

10. Limitations and Criticisms

10.1 Hallucination

The most serious limitation of the GPT series is the Hallucination problem -- confidently generating information that is factually incorrect.

Types of Hallucination:

Factual Errors: Non-existent citations, incorrect statistics, fabricated historical facts
Logical Leaps: Jumping from premises to conclusions without valid reasoning
Self-contradiction: Making contradictory claims within the same conversation

Root Causes:

Auto-regressive models simply generate "plausible next tokens" without verifying factual accuracy
Training data contains errors, and the model cannot distinguish them
RLHF may encourage confident errors by rewarding "speaking confidently"

GPT-4 reduced hallucination by approximately 40% compared to GPT-3.5 through RLHF, but complete resolution remains elusive. This is one of the most active research areas in current LLM research.

10.2 Bias

Large language models reflect and sometimes amplify social biases inherent in their training data.

Types of Bias:

Gender Bias: Reflection of stereotypes in occupations, personality traits, etc.
Racial/Ethnic Bias: Negative associations with specific races
Cultural Bias: English-speaking, particularly US-centric worldview
Socioeconomic Bias: Overrepresentation of certain class perspectives

The GPT-3 paper explicitly acknowledged this and included bias analysis related to Gender, Race, and Religion. InstructGPT and GPT-4 attempted to reduce bias through RLHF, but completely eliminating bias inherent in training data remains a fundamentally challenging problem.

10.3 Environmental Cost

The environmental cost of large-scale model training is becoming an increasingly significant concern.

Estimated Training Carbon Emissions:

GPT-3: Approximately 552 tons CO2e (equivalent to the annual emissions of about 120 average US cars)
GPT-4: Estimated at approximately 15,000 tons CO2e (unofficial estimate, about 27x GPT-3)

Water Consumption:

Microsoft reportedly used approximately 700,000 liters of freshwater for data center cooling during GPT-3 training

Criticism and Counterarguments:

While the cost of a single training run is large, the trained model is used by hundreds of millions, so the per-person cost is negligible
Model efficiency improvements (Distillation, Quantization, Pruning) and hardware advances are reducing costs
However, concerns about Jevons Paradox (where efficiency improvements actually increase total consumption) also exist

10.4 Transparency and Reproducibility

One of the most persistent criticisms of the GPT series is lack of transparency.

GPT-1: Paper, code, and model released (relatively open)
GPT-2: Paper released, model released in stages ("too dangerous" controversy)
GPT-3: Paper released, model accessible only via API
GPT-4: Architecture, data, training cost, and other key information undisclosed

This trend has deepened the disconnect with the organization's name "Open" AI and seriously undermined academic reproducibility. In response, the importance of open models such as Meta's LLaMA and Mistral AI's Mistral/Mixtral has become more prominent.

10.5 Economic Inequality and Compute Divide

The concentration of resources needed for large-scale model training exacerbates economic inequality in AI research.

GPT-3 training cost: Approximately $4.6 million (estimated)
GPT-4 training cost: Over approximately $100 million (estimated)
Investments of this scale are possible only for a few large corporations, structurally excluding universities and small research labs

11. Summary: The Legacy of GPT

The key insights running through the five papers of the GPT series can be summarized as follows:

1. Scale is (almost) all you need

The scaling from GPT-1 (117M) to GPT-2 (1.5B) to GPT-3 (175B) was not simply "the same thing but bigger" -- it led to qualitatively new emergent abilities. Zero-shot, in-context learning, and complex reasoning were Emergent Abilities that appear only at sufficient scale.

2. Alignment changes everything

InstructGPT showed that training methodology can matter more than model size. The 1.3B InstructGPT beating 175B GPT-3 demonstrated that there is a large gap between raw capability and usefulness, and RLHF can bridge that gap.

3. The bitter lesson revisited

Rich Sutton's "The Bitter Lesson" -- general methods + more compute beat specialized methods -- was repeatedly confirmed in the GPT series. General-purpose Transformer + large-scale pre-training was overwhelmingly more effective than task-specific architectures.

4. Data is the new bottleneck

After Chinchilla's lesson, the quantity and quality of training data emerged as a key bottleneck alongside model size. High-quality text on the internet is finite, and Synthetic Data generation is emerging as a new research direction.

5. Safety is not optional

From GPT-2's "too dangerous to release" controversy to GPT-4's red-teaming, safety has become mandatory, not optional. As AI models become more powerful, the importance of safe and responsible development grows proportionally.

The GPT series is not yet over. What capabilities GPT-5 and beyond will show remains unknown, but one thing is certain: the paradigm of "large-scale pre-training + human feedback alignment" established by the GPT series has become the foundation of modern AI, and understanding it is essential for understanding the future of AI.

12. References

GPT-1: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI Paper
GPT-2: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Paper
GPT-3: Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. arXiv:2005.14165
InstructGPT: Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. arXiv:2203.02155
GPT-4: OpenAI. (2023). "GPT-4 Technical Report." arXiv:2303.08774
Scaling Laws: Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361
Chinchilla: Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556
Transformer: Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
PPO: Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347
RLHF: Christiano, P. F., Leike, J., Brown, T., et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. arXiv:1706.03741
Sparse Transformer: Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv:1904.10509
BPE: Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. arXiv:1508.07909
Carbon Footprint: Patterson, D., Gonzalez, J., Le, Q., et al. (2021). "Carbon Emissions and Large Neural Network Training." arXiv:2104.10350

Build Your Own GPT -- Training from Scratch with nanoGPT -- Code a GPT yourself
Complete Math Guide for AI -- Math needed to understand Transformers
Attention Is All You Need Analysis -- The original Transformer paper
BERT Analysis -- The Encoder model rivaling GPT
RWKV: Reinventing RNNs -- Alternative architecture to Transformers
vLLM Inference Optimization -- Serving GPT models
LLM Quantization GPTQ/AWQ/GGUF -- Making large models lightweight

GitHub

nanoGPT -- Andrej Karpathy
ai-model-analysis -- Code-level analysis collection from this blog