Complete Analysis of the GPT Series Papers: The Journey from GPT-1 to GPT-4, How Language Models Changed the World

1. GPT Series Overview and Timeline
2. GPT-1 (2018): The Beginning of Generative Pre-Training
3. GPT-2 (2019): The Possibility of Zero-shot Learning
4. GPT-3 (2020): The Power of In-context Learning and Scaling
5. InstructGPT / ChatGPT (2022): Aligning with Human Intent
6. GPT-4 (2023): Multimodal and Predictable Scaling
7. In-depth Analysis of Scaling Laws
8. Overall Architecture Comparison
- 8.1 Generation-by-Generation Architecture Comparison Table
- 8.2 Evolution of Paradigms
9. GPT's Impact: Transformation of the AI Ecosystem
10. Limitations and Criticisms
11. Summary: The Legacy of GPT
12. References
Related Series and Recommended Posts
- GitHub

1. GPT Series Overview and Timeline

GPT (Generative Pre-trained Transformer) is a series of Large Language Models (LLMs) published by OpenAI since 2018. True to its name "Generative Pre-trained Transformer," it established the paradigm of performing unsupervised pre-training on large-scale text data based on the Transformer Decoder architecture, then applying it to various downstream tasks.

The GPT series did not simply grow in model size -- each generation redefined how language models are utilized. The journey in chronological order is as follows:

Generation	Release	Paper Title	Key Keywords	Parameters
GPT-1	2018.06	Improving Language Understanding by Generative Pre-Training	Unsupervised Pre-training + Supervised Fine-tuning	117M
GPT-2	2019.02	Language Models are Unsupervised Multitask Learners	Zero-shot Transfer, WebText	1.5B
GPT-3	2020.05	Language Models are Few-Shot Learners	In-context Learning, Scaling Laws	175B
InstructGPT	2022.03	Training Language Models to Follow Instructions with Human Feedback	RLHF, Human Alignment	1.3B~175B
GPT-4	2023.03	GPT-4 Technical Report	Multimodal, Predictable Scaling	Undisclosed

It is noteworthy that each generation's paper title carries its core message. GPT-1 declared "improving language understanding through generative pre-training," GPT-2 claimed "language models are unsupervised multitask learners," and GPT-3 went a step further with "language models are few-shot learners." InstructGPT presented a practical direction of "training to follow instructions with human feedback," and GPT-4 was simply published as a "technical report," hinting at its commercial transition.

In this article, we analyze each paper's key contributions, architecture details, training methodology, and impact on subsequent research, together with equations.

2. GPT-1 (2018): The Beginning of Generative Pre-Training

2.1 Paper Overview

Paper: "Improving Language Understanding by Generative Pre-Training" Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (OpenAI) Released: June 2018

The core idea of GPT-1 is surprisingly simple. Pre-train a language model on large-scale unlabeled text, then fine-tune it on a specific task with a small amount of labeled data. This two-stage approach (Semi-supervised Learning) transformed the NLP landscape at the time.

In 2018, NLP was dominated by task-specific architectures. Designing separate models for each task such as sentiment analysis, question answering, and textual entailment, and training them only on task-specific labeled data was standard. GPT-1 proposed a new path of "general-purpose pre-training" to this paradigm.

2.2 Architecture Details

GPT-1 adopted an architecture using only the Decoder blocks of the Transformer. While the original Transformer (Vaswani et al., 2017) had an Encoder-Decoder structure, GPT-1 chose a Decoder-only structure suitable for auto-regressive language modeling.

Model Configuration:

Number of Layers: 12 Transformer Decoder blocks
Hidden Dimension: 768
Number of Attention Heads: 12 (64 dimensions each)
Feed-Forward Dimension: 3,072 ( $= 768 \times 4$ )
Context Window: 512 tokens
Total Parameters: Approximately 117M (117 million)
Activation Function: GELU (Gaussian Error Linear Unit)
Positional Encoding: Learned Positional Embedding

Instead of the fixed Sinusoidal Positional Encoding used in the original Transformer, GPT-1 adopted learned positional embeddings. This allowed the model to learn positional information directly from data, enabling more flexible adaptation to various tasks.

2.3 Stage 1: Unsupervised Pre-training

In the pre-training stage, the standard language modeling objective is optimized over a large-scale unlabeled text corpus $\mathcal{U} = \{u_1, u_2, ..., u_n\}$ .

L_1(\mathcal{U}) = \sum_i \log P(u_i \mid u_{i-k}, ..., u_{i-1}; \Theta)

Here, $k$ is the context window size and $\Theta$ represents the model parameters. This is a typical Auto-regressive Language Modeling objective that maximizes the probability of the next token given the previous $k$ tokens.

Specifically, each token's representation is computed as follows:

h_0 = UW_e + W_p

h_l = \text{transformer\_block}(h_{l-1}), \quad l \in [1, n]

P(u) = \text{softmax}(h_n W_e^T)

Here, $U = (u_{-k}, ..., u_{-1})$ is the context token vector, $W_e$ is the token embedding matrix, and $W_p$ is the position embedding matrix. Output probabilities are computed by reusing the token embedding matrix $W_e$ (Weight Tying).

Training Data: The BooksCorpus dataset was used, consisting of approximately 7,000 unpublished books containing about 5GB of text. The abundance of long-form text made it suitable for learning long-range dependencies.

Tokenization: BPE (Byte Pair Encoding) was used with 40,000 merges to construct the vocabulary.

Optimization: The Adam Optimizer was used with a learning rate that linearly increased from 0 to $2.5 \times 10^{-4}$ during the first 2,000 steps (Linear Warmup), then decreased with Cosine Annealing. Batch Size was 64, trained for 100 epochs.

2.4 Stage 2: Supervised Fine-tuning

To apply the pre-trained model to a specific task, it is fine-tuned with labeled data $\mathcal{C}$ . Given an input token sequence $x_1, ..., x_m$ with corresponding label $y$ , the following objective is optimized:

L_2(\mathcal{C}) = \sum_{(x,y)} \log P(y \mid x_1, ..., x_m)

Here, $P(y \mid x_1, ..., x_m) = \text{softmax}(h_l^m W_y)$ , where $h_l^m$ is the last token output of the final Transformer block, and $W_y$ is the weight of the task-specific Linear Head.

Key Technique -- Auxiliary Language Modeling Objective: GPT-1 also used the original language modeling objective as an auxiliary loss during fine-tuning. This had the effect of improving generalization performance and accelerating convergence.

L_3(\mathcal{C}) = L_2(\mathcal{C}) + \lambda \cdot L_1(\mathcal{C})

Here, $\lambda$ is the weight of the auxiliary loss, and the paper used $\lambda = 0.5$ .

2.5 Task-specific Input Transformation

Another important contribution of GPT-1 was presenting input transformation techniques to handle various tasks with a single Transformer architecture. Without changing the architecture itself, it adapted to multiple tasks by only changing the input format.

Text Classification: Input as [Start] text [Extract] and apply a Linear Layer to the last token's output
Textual Entailment: Connect two sentences as [Start] premise [Delimiter] hypothesis [Extract]
Semantic Similarity: Reverse the order of two sentences to create two inputs, and element-wise add their outputs
Multiple Choice: Individually concatenate each choice with context to create multiple sequences, and normalize with Softmax

This approach was very practical in that it could be applied to various tasks with minimal changes to the model architecture. The only additional parameters were the delimiter token embeddings and the final Linear Layer weights $W_y$ .

2.6 Experimental Results and Significance

GPT-1 achieved state-of-the-art results on 9 out of 12 NLP benchmarks. In particular, it significantly outperformed existing models in Commonsense Reasoning (86.5% accuracy on Stories Cloze Test), Semantic Similarity (70.3 F1 on QQP), and Question Answering (59.0% accuracy on RACE).

However, the true significance of GPT-1 lies not in individual benchmark performance but in establishing the paradigm of "large-scale unsupervised pre-training + small-scale supervised fine-tuning." This paradigm continued with BERT, RoBERTa, T5, and others, becoming the standard in NLP.

3. GPT-2 (2019): The Possibility of Zero-shot Learning

3.1 Paper Overview

Paper: "Language Models are Unsupervised Multitask Learners" Authors: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever (OpenAI) Released: February 2019

The paper title of GPT-2 carries a bold claim: "Language models are unsupervised multitask learners." That is, despite being trained with a single objective of language modeling, the model can perform multiple tasks without separate fine-tuning.

While GPT-1 required two stages of "pre-training then fine-tuning," GPT-2 demonstrated that tasks can be performed zero-shot without fine-tuning. This was a fundamental paradigm shift.

3.2 Core Idea: Task as Language Modeling

The core insight of GPT-2 is that all NLP tasks can be reformulated as conditional language modeling.

Traditional supervised learning learns the conditional probability $P(\text{output} \mid \text{input})$ . GPT-2 extends this to the form $P(\text{output} \mid \text{input}, \text{task})$ , providing task information in natural language.

For example:

Translation: Expressing sequences of the form (translate to french, english text, french text) as natural text
Summarization: Appending TL;DR: after text to elicit a summary
Question Answering: Providing context and questions in natural language to generate answers

The key to this idea is that if a sufficiently large language model learns sufficiently diverse text, task performance capabilities naturally emerge.

3.3 Architecture Details

GPT-2 is based on the GPT-1 architecture with several important modifications.

Key Changes:

Layer Normalization Position Change: Moved to the input side of each sub-block (Pre-norm)
Additional Layer Normalization: Added after the final Self-attention block
Residual Weight Initialization: Residual path weights scaled by $1/\sqrt{N}$ ( $N$ is the number of Residual Layers)
Context Window Expansion: 512 to 1,024 tokens
Vocabulary Size Expansion: 40,000 to 50,257 (Byte-level BPE)
Batch Size Expansion: 64 to 512

GPT-2 trained four model sizes:

Model	Parameters	Layers	Hidden Dim	Heads	Head Dim
Small	117M	12	768	12	64
Medium	345M	24	1,024	16	64
Large	762M	36	1,280	20	64
XL	1,542M	48	1,600	25	64

The Head Dimension is fixed at 64 across all models, and the Feed-forward Layer dimension is always 4 times the Hidden Dimension ( $d_{ff} = 4 \times d_{model}$ ).

3.4 WebText Dataset

Another key contribution of GPT-2 is the WebText training dataset.

Data Construction Method:

Collected external links with 3+ Karma on Reddit (effectively human-vetted quality)
Collected approximately 45 million links
Extracted text from HTML using Dragnet and Newspaper libraries
Deduplication and heuristic-based cleaning

Dataset Characteristics:

Approximately 8 million documents
Approximately 40GB of text
Wikipedia was intentionally excluded (to prevent data leakage with evaluation datasets)

The design philosophy of WebText was "leverage human curation while avoiding explicit labeling costs." The idea of using Reddit's Karma system as a quality filter inspired many subsequent dataset constructions.

3.5 Byte-level BPE

GPT-2 also introduced an important innovation in tokenization. While existing BPE operates at the Unicode character level, GPT-2 applied BPE at the byte level.

Advantages of this approach:

Complete Coverage: Since any byte sequence can be encoded, the OOV (Out-of-Vocabulary) problem is fundamentally eliminated
Multilingual Support: Various languages and special characters can be processed without separate preprocessing
Base Vocabulary Size: 256 (number of bytes) + special tokens

However, since naive byte-level BPE generates many inefficient merges, GPT-2 added rules to prevent merging characters of different categories. The final vocabulary size is 50,257.

3.6 Zero-shot Performance and Scaling

GPT-2's zero-shot performance consistently improved with model size. This was a precursor to the later Scaling Laws research.

Key Zero-shot Results:

Language Modeling: State-of-the-art on 7 out of 8 Language Modeling benchmarks (including domains not in WebText training)
Children's Book Test (Named Entity): 93.3% accuracy (+7% over previous SOTA)
LAMBADA: Perplexity 8.6 (drastically improved from previous SOTA of 99.8)
Reading Comprehension (CoQA): 55.0 F1 (surpassing 3 out of 4 existing models trained with 127,000 examples)
Translation (WMT14 En-Fr): 11.5 BLEU zero-shot (slightly surpassing unsupervised translation SOTA)
Summarization (CNN/Daily Mail): Elicited with TL;DR prompt, qualitatively meaningful results

3.7 "Too Dangerous to Release" Controversy

GPT-2 received as much attention for its release policy as for its technical achievements. OpenAI initially decided not to release the 1.5B parameter model, releasing only the smallest 117M model. The reason was "the risk of malicious use (fake news, spam, etc.) is significant."

This decision sparked intense debate in the AI community.

Supporting Arguments:

Unrestricted release of powerful text generation models could be exploited for mass production of disinformation
A precedent for Responsible Disclosure considering societal impact was needed

Critical Arguments:

The danger of the 1.5B parameter model was exaggerated
It hinders reproducibility in the academic community
Suspicions of marketing-driven exaggeration

Eventually, OpenAI released the full model in November 2019, and the feared large-scale misuse did not materialize. However, this debate became an important catalyst for subsequent AI Safety and Responsible AI discussions.

4. GPT-3 (2020): The Power of In-context Learning and Scaling

4.1 Paper Overview

Paper: "Language Models are Few-Shot Learners" Authors: Tom B. Brown, Benjamin Mann, Nick Ryder and many others (OpenAI) Released: May 2020 (NeurIPS 2020)

GPT-3 is a language model of unprecedented scale with 175 billion (175B) parameters. However, the true innovation of GPT-3 is not its size but establishing the new paradigm of In-context Learning. It proved that various tasks can be performed without updating the model weights at all, simply by including a few examples in the prompt.

4.2 In-context Learning Paradigm

The GPT-3 paper systematically compared three evaluation conditions.

Zero-shot: Only a task description provided in natural language

Translate English to French:
cheese =>

One-shot: Task description + 1 example provided

Translate English to French:
sea otter => loutre de mer
cheese =>

Few-shot: Task description + 10-100 examples provided (within the context window limit)

Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe en peluche
cheese =>

All three conditions involve absolutely no gradient updates. The model performs tasks purely through forward passes. This is the decisive difference from fine-tuning.

The paper's interpretation of why in-context learning works is that during pre-training, the model naturally learns various task patterns, and the examples in the prompt serve to "locate and activate" relevant abilities that already exist within the model.

4.3 Architecture Details

GPT-3 uses essentially the same architecture as GPT-2, but inspired by Sparse Transformer (Child et al., 2019), alternates between Dense and Locally Banded Sparse Attention patterns.

GPT-3 trained 8 model sizes to systematically analyze scaling effects.

Model Name	Parameters	Layers	$d_{model}$	Heads	$d_{head}$	Batch Size	Learning Rate
GPT-3 Small	125M	12	768	12	64	0.5M	$6.0 \times 10^{-4}$
GPT-3 Medium	350M	24	1,024	16	64	0.5M	$3.0 \times 10^{-4}$
GPT-3 Large	760M	24	1,536	16	96	0.5M	$2.5 \times 10^{-4}$
GPT-3 XL	1.3B	24	2,048	24	128	1M	$2.0 \times 10^{-4}$
GPT-3 2.7B	2.7B	32	2,560	32	80	1M	$1.6 \times 10^{-4}$
GPT-3 6.7B	6.7B	32	4,096	32	128	2M	$1.2 \times 10^{-4}$
GPT-3 13B	13.0B	40	5,140	40	128	2M	$1.0 \times 10^{-4}$
GPT-3 175B	175.0B	96	12,288	96	128	3.2M	$0.6 \times 10^{-4}$

All models use a 2,048 token context window and were trained on a total of 300B (300 billion) tokens. A consistent pattern of decreasing learning rate and increasing batch size with larger models was applied.

4.4 Training Data Composition

GPT-3's training data is a mixture of multiple sources, with the notable characteristic of applying differential training weights based on each source's quality.

Dataset	Tokens (B)	Training Weight	Epoch
Common Crawl (filtered)	410	60%	0.44
WebText2	19	22%	2.9
Books1	12	8%	1.9
Books2	55	8%	0.43
Wikipedia	3	3%	3.4

A notable point is that while Common Crawl accounts for most of the tokens, its training weight is limited to 60%. In contrast, the high-quality WebText2 with only 19B tokens is given a high weight of 22%. This reflects the judgment that data quality is more important than quantity.

Common Crawl Filtering Process:

Document filtering based on similarity with high-quality reference corpora (WebText, Books, Wikipedia)
Fuzzy deduplication between documents
Adding reference corpora to the training data for the final composition

4.5 Benchmark Performance

GPT-3 175B's few-shot performance was impressive across various benchmarks.

Language Modeling:

PTB (Penn Treebank): 20.50 Perplexity (Zero-shot SOTA)

Question Answering:

TriviaQA: 71.2% accuracy (Few-shot, competitive with Fine-tuned SOTA)
NaturalQuestions: 29.9% accuracy (Few-shot)
WebQuestions: 41.5% accuracy (Few-shot)

Translation:

WMT14 En to Fr: 25.2 BLEU (Few-shot)
WMT14 Fr to En: 33.9 BLEU (Few-shot)
WMT16 En to De: 24.3 BLEU (Few-shot)

SuperGLUE:

Achieved 71.8 points with Few-shot (surpassing Fine-tuned BERT-Large at 69.0)
However, did not reach Fine-tuned SOTA (90.0 points)

Arithmetic Reasoning:

2-digit addition: 100% accuracy
3-digit addition: 80.4% accuracy
4-5 digit addition: rapid decline

These results demonstrated a clear scaling effect where performance improves with larger model size and more provided examples.

4.6 GPT-3's Recognized Limitations

The paper also candidly described GPT-3's limitations.

Text Generation Quality: Issues with repetition, loss of coherence, and illogical statements during long document generation Limitations of Few-shot: Underperforming fine-tuning-based models on natural language inference (NLI) and some reading comprehension tasks Absence of Bidirectional Context: An inherent limitation of auto-regressive models, with tasks where bidirectional models like BERT have advantages Sample Efficiency: While humans learn new tasks from one or two examples, GPT-3 requires tens to hundreds of examples Lack of Interpretability: Difficulty understanding the model's decision-making process, and the exact mechanism of in-context learning remains unclear

5. InstructGPT / ChatGPT (2022): Aligning with Human Intent

5.1 Paper Overview

Paper: "Training Language Models to Follow Instructions with Human Feedback" Authors: Long Ouyang, Jeff Wu, Xu Jiang and many others (OpenAI) Released: March 2022 (NeurIPS 2022)

Language models up to GPT-3 had a fundamental problem: the training objective of "next token prediction" did not align with the actual use purpose of "following user instructions usefully and safely." No matter how capable a large language model was, it frequently gave irrelevant answers to questions, generated harmful content, or confidently stated inaccurate information.

InstructGPT is a groundbreaking study that solved this Alignment Problem with RLHF (Reinforcement Learning from Human Feedback). And this technology became the foundation of ChatGPT.

5.2 Definition of the Alignment Problem

The paper classified the problems of existing language models into three categories:

Lack of Helpfulness: Not following user instructions and generating irrelevant text
Lack of Truthfulness: Generating factually incorrect information (Hallucination)
Lack of Harmlessness: Generating harmful or biased content

These three combined form the HHH (Helpful, Honest, Harmless) criteria, and InstructGPT aimed to align the model to these criteria using human feedback.

5.3 RLHF 3-Stage Pipeline

InstructGPT's RLHF pipeline consists of three stages.

Step 1: Supervised Fine-Tuning (SFT)

The first stage is traditional supervised learning. Human labelers directly write ideal responses to prompts, and GPT-3 is fine-tuned with this data.

Data: Approximately 13,000 (prompt, ideal response) pairs
Prompt Sources: Prompts written by labelers + prompts submitted by OpenAI API users
Training: 16 epochs, Cosine Learning Rate Decay

The SFT model provides basic instruction-following capability, but it is not yet complete. The next stage learns human preferences.

Step 2: Reward Model (RM) Training

In the second stage, a Reward Model that quantifies human preferences is trained.

Data Collection Process:

Generate $K$ different responses for one prompt using the SFT model ( $K$ ranges from 4 to 9)
Human labelers rank the $K$ responses by preference
Generate $\binom{K}{2}$ comparison pairs

Reward Model Loss Function:

\text{loss}(\theta) = -\frac{1}{\binom{K}{2}} E_{(x, y_w, y_l) \sim D} \left[ \log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right]

Here, $r_\theta(x, y)$ is the scalar output of the Reward Model for prompt $x$ and response $y$ , $y_w$ is the preferred response, $y_l$ is the non-preferred response, and $\sigma$ is the Sigmoid function.

This loss function is based on the Bradley-Terry model, training so that the reward of the preferred response is higher than the non-preferred response. Efficiency was improved by creating $\binom{K}{2}$ comparison pairs from a single prompt and computing them in a single forward pass.

Data Scale: Comparison data collected from approximately 33,000 prompts
Model Size: 6B parameters (removing the final unembedding layer from the SFT model and adding a scalar output head)

Step 3: Reinforcement Learning with PPO

In the third stage, the SFT model is optimized using the PPO (Proximal Policy Optimization) algorithm with the trained Reward Model as the reward signal.

PPO Optimization Objective:

\text{objective}(\phi) = E_{(x, y) \sim D_{\pi_\phi^{RL}}} \left[ r_\theta(x, y) - \beta \cdot D_{KL}(\pi_\phi^{RL}(y \mid x) \| \pi^{SFT}(y \mid x)) \right]

Where:

$\pi_\phi^{RL}$ : The RL policy being trained (language model)
$\pi^{SFT}$ : Reference policy from the SFT stage
$r_\theta(x, y)$ : Reward Model output
$\beta$ : KL Penalty coefficient
$D_{KL}$ : KL Divergence

Role of KL Divergence Penalty:

The KL Divergence term prevents the model from straying too far from the SFT model during RL training. Without this constraint, the model can exploit loopholes in the Reward Model to obtain high rewards while actually generating meaningless text -- a phenomenon known as Reward Hacking.

The exact form of the KL Divergence is:

D_{KL}(\pi_\phi^{RL}(\cdot \mid x) \| \pi^{SFT}(\cdot \mid x)) = \sum_y \pi_\phi^{RL}(y \mid x) \log \frac{\pi_\phi^{RL}(y \mid x)}{\pi^{SFT}(y \mid x)}

In practice, this KL Divergence is applied by directly subtracting it from the reward. That is, the modified reward is:

R(x, y) = r_\theta(x, y) - \beta \cdot \log \frac{\pi_\phi^{RL}(y \mid x)}{\pi^{SFT}(y \mid x)}

PPO-ptx: Pre-training Mix

InstructGPT additionally proposed the PPO-ptx variant, which mixes the language modeling objective on the original pre-training data as an auxiliary loss during RL training.

\text{objective}(\phi) = E_{(x, y) \sim D_{\pi_\phi^{RL}}} \left[ r_\theta(x, y) - \beta \cdot D_{KL}(\pi_\phi^{RL} \| \pi^{SFT}) \right] + \gamma \cdot E_{x \sim D_{\text{pretrain}}} \left[ \log \pi_\phi^{RL}(x) \right]

Here, $\gamma$ is the weight of the pre-training loss. This term prevents the degradation of the model's general language capabilities during RL training ("Alignment Tax").

5.4 Remarkable Result: Small Model Beats Large Model

InstructGPT's most remarkable result is that 1.3B parameter InstructGPT was preferred over 175B parameter GPT-3 in human evaluations. A model with more than 100 times fewer parameters generated more useful, more truthful, and more harmless responses.

Key Experimental Results:

InstructGPT outputs overwhelmingly preferred over GPT-3 outputs in human evaluation
Similar or slightly lower performance compared to GPT-3 on public NLP benchmarks (Alignment Tax)
Significant improvement of PPO model over GPT-3 on TruthfulQA
Approximately 25% reduction in toxicity generation compared to GPT-3

This result showed that training methodology matters more than model size. "Making it bigger" is not the only answer -- "aligning it with human intent" is the key lesson.

5.5 From InstructGPT to ChatGPT

InstructGPT's technology became the core foundation of ChatGPT, released in November 2022. ChatGPT is a model that applied conversational RLHF to GPT-3.5 (an improved version of GPT-3).

ChatGPT's release was a turning point in AI history. Reaching 1 million users in 5 days and 100 million users in 2 months, it ushered in an era where AI directly reached the general public. Without InstructGPT's technical contributions, this revolution would have been impossible.

6. GPT-4 (2023): Multimodal and Predictable Scaling

6.1 Paper Overview

Paper: "GPT-4 Technical Report" Authors: OpenAI Released: March 2023 (arXiv: 2303.08774)

The GPT-4 Technical Report is fundamentally different from previous GPT papers. Most key information including architecture, model size, training data, and training costs is undisclosed. OpenAI cited "competitive landscape and safety considerations" as reasons for not disclosing this information. This was widely criticized for the disconnect with the "Open" in OpenAI.

Nevertheless, the paper contains several important technical contributions.

6.2 Multimodal Input

The most notable new capability of GPT-4 is that it can accept both images and text as input simultaneously. Output is still limited to text only.

Examples of Multimodal Capabilities:

Recognition and interpretation of text within images
Data analysis of charts and graphs
Description of humor images and interpretation of their humor
Interpretation of scientific diagrams and solving related problems

This multimodal capability later evolved into GPT-4V (Vision) and was applied to actual services.

6.3 Predictable Scaling

The most important technical contribution of the GPT-4 paper is the Predictable Scaling methodology.

The core idea is that the performance of a large model can be accurately predicted from the performance of small models. OpenAI measured the performance of smaller models trained with the same methodology as GPT-4, predicted GPT-4's final performance from this, and compared it with actual training results.

Loss Prediction: From the training of models using 1,000x to 10,000x less compute, GPT-4's final loss was predicted using a Power Law. The actual training result was very close to the prediction.

HumanEval Coding Performance Prediction: The pass rate on a coding benchmark could also be predicted from smaller model results. This suggests that not only loss but specific task performance is predictable.

The practical value of this Predictable Scaling methodology is immense. Before committing to large-scale model training costing tens of millions to hundreds of millions of dollars, small-scale experiments can predict the final performance to evaluate return on investment in advance.

However, the paper acknowledged that phenomena such as inverse scaling and sudden emergent abilities are hard to predict. In particular, emergent abilities -- where specific capabilities suddenly appear at a certain scale -- are a major exception to Predictable Scaling.

6.4 Professional Exam Performance

GPT-4 demonstrated impressive performance on various professional exams designed for humans. The model received no specific training for these exams.

Exam	GPT-4 Score/Percentile	GPT-3.5 Score/Percentile	Note
Uniform Bar Exam (MBE+MEE+MPT)	~298/400 (top 10%)	~213/400 (bottom 10%)	US Bar Exam
LSAT	163 (top 12%)	149 (bottom 40%)	Law School Admission
SAT Evidence-Based R&W	710/800 (93rd)	670/800 (87th)	US College Admission
SAT Math	700/800 (89th)	590/800 (70th)	US College Admission
GRE Quantitative	163/170 (80th)	157/170 (62nd)	Graduate Admission
GRE Verbal	169/170 (99th)	154/170 (63rd)	Graduate Admission
AP Biology	5 (85~100th)	4 (62~85th)	AP Biology
AP Chemistry	4 (71~88th)	2 (22~46th)	AP Chemistry
AP Calculus BC	4 (43~59th)	1 (0~7th)	AP Calculus
AP English Literature	2 (8~22nd)	2 (8~22nd)	AP English Literature

Notable patterns:

Dramatic performance improvement over GPT-3.5 in law, science, and mathematics (Bar Exam: bottom 10% to top 10%)
Relatively weak performance in language/literature (AP English Literature: bottom 22%)
Mathematical reasoning improved but still not top-tier (AP Calculus BC: 43~59th percentile)

6.5 Safety and Alignment Improvements

GPT-4 was significantly improved in safety compared to GPT-3.5.

RLHF-based Safety Training:

Introduced additional safety reward signals in the training process
Used GPT-4 Zero-shot Classifier to judge safety boundaries and response styles
Applied safety rewards to both allowed/disallowed categories to prevent over-refusal of valid requests

Quantitative Improvements:

82% reduction in response rate to disallowed content requests compared to GPT-3.5
29% improvement in policy compliance for sensitive requests (medical advice, self-harm, etc.)
40% higher score on internal adversarial factuality evaluation compared to GPT-3.5
Improvement from approximately 60% to 80% on TruthfulQA after RLHF

Expert Red-teaming:

Over 50 domain experts (AI safety, cybersecurity, biological risks, international security, etc.) participated in adversarial testing
Evaluation of high-risk scenarios (autonomous replication, chemical/biological weapons information, etc.)

6.6 GPT-4's Limitations

The limitations explicitly acknowledged in the paper are:

Hallucination: Can still "confidently" generate factually incorrect information. Greatly improved by RLHF but not fully resolved.
Context Window Limitation: Limited to 8K/32K tokens at training time, limiting very long document processing.
Training Data Cutoff: Does not know information after the training data cutoff (trained on data up to September 2021).
Incomplete Reasoning: Can make mistakes in complex multi-step reasoning, especially in mathematical proofs and subtle code bugs.
Bias and Calibration: Social biases have not been fully removed, and the model's confidence does not necessarily match actual accuracy.

7. In-depth Analysis of Scaling Laws

7.1 Kaplan Scaling Laws (2020)

"Scaling Laws for Neural Language Models" published by Jared Kaplan and others at OpenAI contemporaneously with GPT-3 provided the theoretical foundation for large language model research.

Key Finding -- Power Law Relationships:

The cross-entropy loss $L$ of a language model has a Power Law relationship with the number of model parameters $N$ , dataset size $D$ , and compute $C$ used for training.

L(N) \propto N^{-\alpha_N}, \quad \alpha_N \approx 0.076

L(D) \propto D^{-\alpha_D}, \quad \alpha_D \approx 0.095

L(C) \propto C^{-\alpha_C}, \quad \alpha_C \approx 0.050

These relationships hold over more than 7 orders of magnitude and show very stable trend lines.

Compute-optimal Allocation (Kaplan Version):

To minimize loss with a fixed compute budget $C$ , the conclusion was that it is optimal to increase model size while using relatively less data. Specifically, when compute increases 10x, it is most efficient to increase model size by 5.5x and data by only 1.8x.

N_{\text{opt}} \propto C^{0.73}, \quad D_{\text{opt}} \propto C^{0.27}

This result led to the interpretation that "increasing model size is more efficient than increasing data," and served as justification for GPT-3's 175B parameter scale.

7.2 Chinchilla Scaling Laws (2022)

An important correction to Kaplan's Scaling Laws was presented in DeepMind's 2022 "Training Compute-Optimal Large Language Models" (known as the Chinchilla paper).

Key Finding: Existing models are under-trained.

Unlike Kaplan's conclusion, the Chinchilla paper argued that model size and training data should be increased at nearly equal rates. Specifically, approximately 20 training tokens per parameter is compute-optimal.

N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}

By this criterion, GPT-3 (175B parameters, 300B tokens) was data-starved. Compute-optimal training would have required approximately 3.5T (3.5 trillion) tokens.

Chinchilla vs. GPT-3:

Item	GPT-3	Chinchilla
Parameters	175B	70B
Training Tokens	300B	1.4T
Token/Parameter Ratio	1.7	20
MMLU Performance	70.0%	73.4%
Compute	~3,640 PF-days	~5,200 PF-days

Chinchilla is a 2.5x smaller model than GPT-3 but achieved higher performance by training on 4.7x more data. This result fundamentally influenced the direction of subsequent large-scale model training.

7.3 Impact of Scaling Laws on GPT-4

GPT-4's Predictable Scaling is a direct application of this Scaling Laws research. If the loss of small models follows a Power Law, then the trend line can be extrapolated to predict the loss of large models.

What the GPT-4 paper showed is that this prediction is surprisingly accurate. This suggests that Scaling Laws are not merely empirical observations but reflect deep structural properties of the language model training process.

However, there are important limitations to this predictability:

Loss is not equal to Capability: Reduction in overall loss may not directly translate to improvement in specific abilities
Emergent Abilities: Abilities that suddenly appear at a certain scale are difficult to predict with Power Laws
Inverse Scaling: In some tasks, performance decreases as the model grows larger
Task-specific Variability: Scaling efficiency varies significantly across tasks

8. Overall Architecture Comparison

8.1 Generation-by-Generation Architecture Comparison Table

Item	GPT-1	GPT-2 (XL)	GPT-3 (175B)	InstructGPT	GPT-4
Release Date	2018.06	2019.02	2020.05	2022.03	2023.03
Parameters	117M	1,542M	175,000M	1,300M~175,000M	Undisclosed
Layers	12	48	96	96 (175B basis)	Undisclosed
Hidden Dim	768	1,600	12,288	12,288 (175B basis)	Undisclosed
Attention Heads	12	25	96	96 (175B basis)	Undisclosed
Head Dimension	64	64	128	128 (175B basis)	Undisclosed
Context Window	512	1,024	2,048	2,048	8,192 / 32,768
Vocabulary Size	40,000	50,257	50,257	50,257	~100,000 (est.)
Training Data	BooksCorpus (5GB)	WebText (40GB)	Mixed (570GB)	GPT-3 + Human Feedback	Undisclosed
Training Tokens	~1B (est.)	~10B (est.)	300B	300B + RLHF	Undisclosed
Tokenization	BPE (40K merges)	Byte-level BPE	Byte-level BPE	Byte-level BPE	Undisclosed
Positional Enc.	Learned	Learned	Learned	Learned	Undisclosed
Activation	GELU	GELU	GELU	GELU	Undisclosed
LayerNorm	Post-norm	Pre-norm	Pre-norm	Pre-norm	Undisclosed
Training Method	LM + Fine-tuning	LM only	LM only	LM + SFT + RLHF	LM + SFT + RLHF
Multimodal	No	No	No	No	Yes (Image Input)
Sparse Attention	No	No	Yes (partial)	Yes (partial)	Undisclosed

8.2 Evolution of Paradigms

More important than the architecture itself is the evolution of paradigms.

GPT-1: Pre-train -> Fine-tune (fine-tuning required for each task)
         |
GPT-2: Pre-train -> Zero-shot (direct use without fine-tuning)
         |
GPT-3: Pre-train -> In-context Learning (task performance with examples only)
         |
InstructGPT: Pre-train -> SFT -> RLHF (alignment with human feedback)
         |
GPT-4: Pre-train -> SFT -> RLHF + Multimodal (multimodal + enhanced safety)

The consistent direction of this evolution is reducing user intervention. GPT-1 required training data and fine-tuning for each task, but by GPT-4, nearly all tasks can be performed with natural language instructions alone.

9. GPT's Impact: Transformation of the AI Ecosystem

9.1 ChatGPT and AI Democratization

The most direct impact of the GPT series is the democratization of AI through ChatGPT.

ChatGPT Growth Metrics:

Released November 30, 2022
1 million users in 5 days
100 million users in 2 months (fastest record ever, surpassing TikTok's 9 months)
Over 700 million weekly active users by end of 2024

ChatGPT transformed the concept of "AI" from an exclusive domain of researchers and developers to an everyday tool for the general public. This transformation would have been impossible without InstructGPT's RLHF technology.

9.2 API Economy and AI-native Services

GPT-3's API release (June 2020) marked the beginning of the AI API Economy.

New Business Models:

Wrapper Services: Building specialized UX on top of the GPT API (Jasper, Copy.ai, etc.)
Vertical AI: AI solutions optimized for specific domains (Harvey for Law, Hippocratic AI for Healthcare)
AI-augmented SaaS: Integrating AI features into existing SaaS (Notion AI, GitHub Copilot, etc.)
Agent Frameworks: Autonomous agents using GPT as the core reasoning engine (AutoGPT, LangChain, etc.)

9.3 Academic Impact

The GPT series also had a fundamental impact on the direction of academic research.

Birth of New Research Fields:

Prompt Engineering: Research on prompt design to maximize the effectiveness of in-context learning
Alignment Research: Various alignment techniques beyond RLHF (DPO, ORPO, Constitutional AI, etc.)
Mechanistic Interpretability: Research to understand the internal workings of large models
Scaling Laws: Quantitative analysis of the relationship between model performance and resources
Evaluation: Recognizing the limitations of existing benchmarks and developing new evaluation methodologies

Changes in Research Methodology:

Shift in research focus from "model architecture innovation" to "data, training methods, alignment"
Growing gap between academic and industrial research due to increasing compute requirements
Partial recovery of academic accessibility through open-source models (LLaMA, Mistral, etc.)

9.4 Impact on Industry and Society

Education: AI tutors, automated grading, personalized learning content generation
Healthcare: Medical document writing assistance, diagnostic support, drug interaction analysis
Law: Case search, contract analysis, legal advice drafting
Software Development: Code generation, debugging, documentation (GitHub Copilot)
Content Creation: Writing assistance, translation, summarization, idea generation

10. Limitations and Criticisms

10.1 Hallucination

The most serious limitation of the GPT series is the Hallucination problem -- confidently generating information that is factually incorrect.

Types of Hallucination:

Factual Errors: Non-existent citations, incorrect statistics, fabricated historical facts
Logical Leaps: Jumping from premises to conclusions without valid reasoning
Self-contradiction: Making contradictory claims within the same conversation

Root Causes:

Auto-regressive models simply generate "plausible next tokens" without verifying factual accuracy
Training data contains errors, and the model cannot distinguish them
RLHF may encourage confident errors by rewarding "speaking confidently"

GPT-4 reduced hallucination by approximately 40% compared to GPT-3.5 through RLHF, but complete resolution remains elusive. This is one of the most active research areas in current LLM research.

10.2 Bias

Large language models reflect and sometimes amplify social biases inherent in their training data.

Types of Bias:

Gender Bias: Reflection of stereotypes in occupations, personality traits, etc.
Racial/Ethnic Bias: Negative associations with specific races
Cultural Bias: English-speaking, particularly US-centric worldview
Socioeconomic Bias: Overrepresentation of certain class perspectives

The GPT-3 paper explicitly acknowledged this and included bias analysis related to Gender, Race, and Religion. InstructGPT and GPT-4 attempted to reduce bias through RLHF, but completely eliminating bias inherent in training data remains a fundamentally challenging problem.

10.3 Environmental Cost

The environmental cost of large-scale model training is becoming an increasingly significant concern.

Estimated Training Carbon Emissions:

GPT-3: Approximately 552 tons CO2e (equivalent to the annual emissions of about 120 average US cars)
GPT-4: Estimated at approximately 15,000 tons CO2e (unofficial estimate, about 27x GPT-3)

Water Consumption:

Microsoft reportedly used approximately 700,000 liters of freshwater for data center cooling during GPT-3 training

Criticism and Counterarguments:

While the cost of a single training run is large, the trained model is used by hundreds of millions, so the per-person cost is negligible
Model efficiency improvements (Distillation, Quantization, Pruning) and hardware advances are reducing costs
However, concerns about Jevons Paradox (where efficiency improvements actually increase total consumption) also exist

10.4 Transparency and Reproducibility

One of the most persistent criticisms of the GPT series is lack of transparency.

GPT-1: Paper, code, and model released (relatively open)
GPT-2: Paper released, model released in stages ("too dangerous" controversy)
GPT-3: Paper released, model accessible only via API
GPT-4: Architecture, data, training cost, and other key information undisclosed

This trend has deepened the disconnect with the organization's name "Open" AI and seriously undermined academic reproducibility. In response, the importance of open models such as Meta's LLaMA and Mistral AI's Mistral/Mixtral has become more prominent.

10.5 Economic Inequality and Compute Divide

The concentration of resources needed for large-scale model training exacerbates economic inequality in AI research.

GPT-3 training cost: Approximately $4.6 million (estimated)
GPT-4 training cost: Over approximately $100 million (estimated)
Investments of this scale are possible only for a few large corporations, structurally excluding universities and small research labs

11. Summary: The Legacy of GPT

The key insights running through the five papers of the GPT series can be summarized as follows:

1. Scale is (almost) all you need

The scaling from GPT-1 (117M) to GPT-2 (1.5B) to GPT-3 (175B) was not simply "the same thing but bigger" -- it led to qualitatively new emergent abilities. Zero-shot, in-context learning, and complex reasoning were Emergent Abilities that appear only at sufficient scale.

2. Alignment changes everything

InstructGPT showed that training methodology can matter more than model size. The 1.3B InstructGPT beating 175B GPT-3 demonstrated that there is a large gap between raw capability and usefulness, and RLHF can bridge that gap.

3. The bitter lesson revisited

Rich Sutton's "The Bitter Lesson" -- general methods + more compute beat specialized methods -- was repeatedly confirmed in the GPT series. General-purpose Transformer + large-scale pre-training was overwhelmingly more effective than task-specific architectures.

4. Data is the new bottleneck

After Chinchilla's lesson, the quantity and quality of training data emerged as a key bottleneck alongside model size. High-quality text on the internet is finite, and Synthetic Data generation is emerging as a new research direction.

5. Safety is not optional

From GPT-2's "too dangerous to release" controversy to GPT-4's red-teaming, safety has become mandatory, not optional. As AI models become more powerful, the importance of safe and responsible development grows proportionally.

The GPT series is not yet over. What capabilities GPT-5 and beyond will show remains unknown, but one thing is certain: the paradigm of "large-scale pre-training + human feedback alignment" established by the GPT series has become the foundation of modern AI, and understanding it is essential for understanding the future of AI.

12. References

GPT-1: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI Paper
GPT-2: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Paper
GPT-3: Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. arXiv:2005.14165
InstructGPT: Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. arXiv:2203.02155
GPT-4: OpenAI. (2023). "GPT-4 Technical Report." arXiv:2303.08774
Scaling Laws: Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361
Chinchilla: Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556
Transformer: Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
PPO: Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347
RLHF: Christiano, P. F., Leike, J., Brown, T., et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. arXiv:1706.03741
Sparse Transformer: Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv:1904.10509
BPE: Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. arXiv:1508.07909
Carbon Footprint: Patterson, D., Gonzalez, J., Le, Q., et al. (2021). "Carbon Emissions and Large Neural Network Training." arXiv:2104.10350

Build Your Own GPT -- Training from Scratch with nanoGPT -- Code a GPT yourself
Complete Math Guide for AI -- Math needed to understand Transformers
Attention Is All You Need Analysis -- The original Transformer paper
BERT Analysis -- The Encoder model rivaling GPT
RWKV: Reinventing RNNs -- Alternative architecture to Transformers
vLLM Inference Optimization -- Serving GPT models
LLM Quantization GPTQ/AWQ/GGUF -- Making large models lightweight

GitHub

nanoGPT -- Andrej Karpathy
ai-model-analysis -- Code-level analysis collection from this blog