- Authors
- Name
- 1. GPT Series Overview and Timeline
- 2. GPT-1 (2018): The Beginning of Generative Pre-Training
- 3. GPT-2 (2019): The Possibility of Zero-shot Learning
- 4. GPT-3 (2020): The Power of In-context Learning and Scaling
- 5. InstructGPT / ChatGPT (2022): Aligning with Human Intent
- 6. GPT-4 (2023): Multimodal and Predictable Scaling
- 7. In-depth Analysis of Scaling Laws
- 8. Overall Architecture Comparison
- 9. GPT's Impact: Transformation of the AI Ecosystem
- 10. Limitations and Criticisms
- 11. Summary: The Legacy of GPT
- 12. References
- Related Series and Recommended Posts
1. GPT Series Overview and Timeline
GPT (Generative Pre-trained Transformer) is a series of Large Language Models (LLMs) published by OpenAI since 2018. True to its name "Generative Pre-trained Transformer," it established the paradigm of performing unsupervised pre-training on large-scale text data based on the Transformer Decoder architecture, then applying it to various downstream tasks.
The GPT series did not simply grow in model size -- each generation redefined how language models are utilized. The journey in chronological order is as follows:
| Generation | Release | Paper Title | Key Keywords | Parameters |
|---|---|---|---|---|
| GPT-1 | 2018.06 | Improving Language Understanding by Generative Pre-Training | Unsupervised Pre-training + Supervised Fine-tuning | 117M |
| GPT-2 | 2019.02 | Language Models are Unsupervised Multitask Learners | Zero-shot Transfer, WebText | 1.5B |
| GPT-3 | 2020.05 | Language Models are Few-Shot Learners | In-context Learning, Scaling Laws | 175B |
| InstructGPT | 2022.03 | Training Language Models to Follow Instructions with Human Feedback | RLHF, Human Alignment | 1.3B~175B |
| GPT-4 | 2023.03 | GPT-4 Technical Report | Multimodal, Predictable Scaling | Undisclosed |
It is noteworthy that each generation's paper title carries its core message. GPT-1 declared "improving language understanding through generative pre-training," GPT-2 claimed "language models are unsupervised multitask learners," and GPT-3 went a step further with "language models are few-shot learners." InstructGPT presented a practical direction of "training to follow instructions with human feedback," and GPT-4 was simply published as a "technical report," hinting at its commercial transition.
In this article, we analyze each paper's key contributions, architecture details, training methodology, and impact on subsequent research, together with equations.
2. GPT-1 (2018): The Beginning of Generative Pre-Training
2.1 Paper Overview
Paper: "Improving Language Understanding by Generative Pre-Training" Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (OpenAI) Released: June 2018
The core idea of GPT-1 is surprisingly simple. Pre-train a language model on large-scale unlabeled text, then fine-tune it on a specific task with a small amount of labeled data. This two-stage approach (Semi-supervised Learning) transformed the NLP landscape at the time.
In 2018, NLP was dominated by task-specific architectures. Designing separate models for each task such as sentiment analysis, question answering, and textual entailment, and training them only on task-specific labeled data was standard. GPT-1 proposed a new path of "general-purpose pre-training" to this paradigm.
2.2 Architecture Details
GPT-1 adopted an architecture using only the Decoder blocks of the Transformer. While the original Transformer (Vaswani et al., 2017) had an Encoder-Decoder structure, GPT-1 chose a Decoder-only structure suitable for auto-regressive language modeling.
Model Configuration:
- Number of Layers: 12 Transformer Decoder blocks
- Hidden Dimension: 768
- Number of Attention Heads: 12 (64 dimensions each)
- Feed-Forward Dimension: 3,072 ()
- Context Window: 512 tokens
- Total Parameters: Approximately 117M (117 million)
- Activation Function: GELU (Gaussian Error Linear Unit)
- Positional Encoding: Learned Positional Embedding
Instead of the fixed Sinusoidal Positional Encoding used in the original Transformer, GPT-1 adopted learned positional embeddings. This allowed the model to learn positional information directly from data, enabling more flexible adaptation to various tasks.
2.3 Stage 1: Unsupervised Pre-training
In the pre-training stage, the standard language modeling objective is optimized over a large-scale unlabeled text corpus .
Here, is the context window size and represents the model parameters. This is a typical Auto-regressive Language Modeling objective that maximizes the probability of the next token given the previous tokens.
Specifically, each token's representation is computed as follows:
Here, is the context token vector, is the token embedding matrix, and is the position embedding matrix. Output probabilities are computed by reusing the token embedding matrix (Weight Tying).
Training Data: The BooksCorpus dataset was used, consisting of approximately 7,000 unpublished books containing about 5GB of text. The abundance of long-form text made it suitable for learning long-range dependencies.
Tokenization: BPE (Byte Pair Encoding) was used with 40,000 merges to construct the vocabulary.
Optimization: The Adam Optimizer was used with a learning rate that linearly increased from 0 to during the first 2,000 steps (Linear Warmup), then decreased with Cosine Annealing. Batch Size was 64, trained for 100 epochs.
2.4 Stage 2: Supervised Fine-tuning
To apply the pre-trained model to a specific task, it is fine-tuned with labeled data . Given an input token sequence with corresponding label , the following objective is optimized:
Here, , where is the last token output of the final Transformer block, and is the weight of the task-specific Linear Head.
Key Technique -- Auxiliary Language Modeling Objective: GPT-1 also used the original language modeling objective as an auxiliary loss during fine-tuning. This had the effect of improving generalization performance and accelerating convergence.
Here, is the weight of the auxiliary loss, and the paper used .
2.5 Task-specific Input Transformation
Another important contribution of GPT-1 was presenting input transformation techniques to handle various tasks with a single Transformer architecture. Without changing the architecture itself, it adapted to multiple tasks by only changing the input format.
- Text Classification: Input as
[Start] text [Extract]and apply a Linear Layer to the last token's output - Textual Entailment: Connect two sentences as
[Start] premise [Delimiter] hypothesis [Extract] - Semantic Similarity: Reverse the order of two sentences to create two inputs, and element-wise add their outputs
- Multiple Choice: Individually concatenate each choice with context to create multiple sequences, and normalize with Softmax
This approach was very practical in that it could be applied to various tasks with minimal changes to the model architecture. The only additional parameters were the delimiter token embeddings and the final Linear Layer weights .
2.6 Experimental Results and Significance
GPT-1 achieved state-of-the-art results on 9 out of 12 NLP benchmarks. In particular, it significantly outperformed existing models in Commonsense Reasoning (86.5% accuracy on Stories Cloze Test), Semantic Similarity (70.3 F1 on QQP), and Question Answering (59.0% accuracy on RACE).
However, the true significance of GPT-1 lies not in individual benchmark performance but in establishing the paradigm of "large-scale unsupervised pre-training + small-scale supervised fine-tuning." This paradigm continued with BERT, RoBERTa, T5, and others, becoming the standard in NLP.
3. GPT-2 (2019): The Possibility of Zero-shot Learning
3.1 Paper Overview
Paper: "Language Models are Unsupervised Multitask Learners" Authors: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever (OpenAI) Released: February 2019
The paper title of GPT-2 carries a bold claim: "Language models are unsupervised multitask learners." That is, despite being trained with a single objective of language modeling, the model can perform multiple tasks without separate fine-tuning.
While GPT-1 required two stages of "pre-training then fine-tuning," GPT-2 demonstrated that tasks can be performed zero-shot without fine-tuning. This was a fundamental paradigm shift.
3.2 Core Idea: Task as Language Modeling
The core insight of GPT-2 is that all NLP tasks can be reformulated as conditional language modeling.
Traditional supervised learning learns the conditional probability . GPT-2 extends this to the form , providing task information in natural language.
For example:
- Translation: Expressing sequences of the form
(translate to french, english text, french text)as natural text - Summarization: Appending
TL;DR:after text to elicit a summary - Question Answering: Providing context and questions in natural language to generate answers
The key to this idea is that if a sufficiently large language model learns sufficiently diverse text, task performance capabilities naturally emerge.
3.3 Architecture Details
GPT-2 is based on the GPT-1 architecture with several important modifications.
Key Changes:
- Layer Normalization Position Change: Moved to the input side of each sub-block (Pre-norm)
- Additional Layer Normalization: Added after the final Self-attention block
- Residual Weight Initialization: Residual path weights scaled by ( is the number of Residual Layers)
- Context Window Expansion: 512 to 1,024 tokens
- Vocabulary Size Expansion: 40,000 to 50,257 (Byte-level BPE)
- Batch Size Expansion: 64 to 512
GPT-2 trained four model sizes:
| Model | Parameters | Layers | Hidden Dim | Heads | Head Dim |
|---|---|---|---|---|---|
| Small | 117M | 12 | 768 | 12 | 64 |
| Medium | 345M | 24 | 1,024 | 16 | 64 |
| Large | 762M | 36 | 1,280 | 20 | 64 |
| XL | 1,542M | 48 | 1,600 | 25 | 64 |
The Head Dimension is fixed at 64 across all models, and the Feed-forward Layer dimension is always 4 times the Hidden Dimension ().
3.4 WebText Dataset
Another key contribution of GPT-2 is the WebText training dataset.
Data Construction Method:
- Collected external links with 3+ Karma on Reddit (effectively human-vetted quality)
- Collected approximately 45 million links
- Extracted text from HTML using Dragnet and Newspaper libraries
- Deduplication and heuristic-based cleaning
Dataset Characteristics:
- Approximately 8 million documents
- Approximately 40GB of text
- Wikipedia was intentionally excluded (to prevent data leakage with evaluation datasets)
The design philosophy of WebText was "leverage human curation while avoiding explicit labeling costs." The idea of using Reddit's Karma system as a quality filter inspired many subsequent dataset constructions.
3.5 Byte-level BPE
GPT-2 also introduced an important innovation in tokenization. While existing BPE operates at the Unicode character level, GPT-2 applied BPE at the byte level.
Advantages of this approach:
- Complete Coverage: Since any byte sequence can be encoded, the OOV (Out-of-Vocabulary) problem is fundamentally eliminated
- Multilingual Support: Various languages and special characters can be processed without separate preprocessing
- Base Vocabulary Size: 256 (number of bytes) + special tokens
However, since naive byte-level BPE generates many inefficient merges, GPT-2 added rules to prevent merging characters of different categories. The final vocabulary size is 50,257.
3.6 Zero-shot Performance and Scaling
GPT-2's zero-shot performance consistently improved with model size. This was a precursor to the later Scaling Laws research.
Key Zero-shot Results:
- Language Modeling: State-of-the-art on 7 out of 8 Language Modeling benchmarks (including domains not in WebText training)
- Children's Book Test (Named Entity): 93.3% accuracy (+7% over previous SOTA)
- LAMBADA: Perplexity 8.6 (drastically improved from previous SOTA of 99.8)
- Reading Comprehension (CoQA): 55.0 F1 (surpassing 3 out of 4 existing models trained with 127,000 examples)
- Translation (WMT14 En-Fr): 11.5 BLEU zero-shot (slightly surpassing unsupervised translation SOTA)
- Summarization (CNN/Daily Mail): Elicited with TL;DR prompt, qualitatively meaningful results
3.7 "Too Dangerous to Release" Controversy
GPT-2 received as much attention for its release policy as for its technical achievements. OpenAI initially decided not to release the 1.5B parameter model, releasing only the smallest 117M model. The reason was "the risk of malicious use (fake news, spam, etc.) is significant."
This decision sparked intense debate in the AI community.
Supporting Arguments:
- Unrestricted release of powerful text generation models could be exploited for mass production of disinformation
- A precedent for Responsible Disclosure considering societal impact was needed
Critical Arguments:
- The danger of the 1.5B parameter model was exaggerated
- It hinders reproducibility in the academic community
- Suspicions of marketing-driven exaggeration
Eventually, OpenAI released the full model in November 2019, and the feared large-scale misuse did not materialize. However, this debate became an important catalyst for subsequent AI Safety and Responsible AI discussions.
4. GPT-3 (2020): The Power of In-context Learning and Scaling
4.1 Paper Overview
Paper: "Language Models are Few-Shot Learners" Authors: Tom B. Brown, Benjamin Mann, Nick Ryder and many others (OpenAI) Released: May 2020 (NeurIPS 2020)
GPT-3 is a language model of unprecedented scale with 175 billion (175B) parameters. However, the true innovation of GPT-3 is not its size but establishing the new paradigm of In-context Learning. It proved that various tasks can be performed without updating the model weights at all, simply by including a few examples in the prompt.
4.2 In-context Learning Paradigm
The GPT-3 paper systematically compared three evaluation conditions.
Zero-shot: Only a task description provided in natural language
Translate English to French:
cheese =>
One-shot: Task description + 1 example provided
Translate English to French:
sea otter => loutre de mer
cheese =>
Few-shot: Task description + 10-100 examples provided (within the context window limit)
Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe en peluche
cheese =>
All three conditions involve absolutely no gradient updates. The model performs tasks purely through forward passes. This is the decisive difference from fine-tuning.
The paper's interpretation of why in-context learning works is that during pre-training, the model naturally learns various task patterns, and the examples in the prompt serve to "locate and activate" relevant abilities that already exist within the model.
4.3 Architecture Details
GPT-3 uses essentially the same architecture as GPT-2, but inspired by Sparse Transformer (Child et al., 2019), alternates between Dense and Locally Banded Sparse Attention patterns.
GPT-3 trained 8 model sizes to systematically analyze scaling effects.
| Model Name | Parameters | Layers | Heads | Batch Size | Learning Rate | ||
|---|---|---|---|---|---|---|---|
| GPT-3 Small | 125M | 12 | 768 | 12 | 64 | 0.5M | |
| GPT-3 Medium | 350M | 24 | 1,024 | 16 | 64 | 0.5M | |
| GPT-3 Large | 760M | 24 | 1,536 | 16 | 96 | 0.5M | |
| GPT-3 XL | 1.3B | 24 | 2,048 | 24 | 128 | 1M | |
| GPT-3 2.7B | 2.7B | 32 | 2,560 | 32 | 80 | 1M | |
| GPT-3 6.7B | 6.7B | 32 | 4,096 | 32 | 128 | 2M | |
| GPT-3 13B | 13.0B | 40 | 5,140 | 40 | 128 | 2M | |
| GPT-3 175B | 175.0B | 96 | 12,288 | 96 | 128 | 3.2M |
All models use a 2,048 token context window and were trained on a total of 300B (300 billion) tokens. A consistent pattern of decreasing learning rate and increasing batch size with larger models was applied.
4.4 Training Data Composition
GPT-3's training data is a mixture of multiple sources, with the notable characteristic of applying differential training weights based on each source's quality.
| Dataset | Tokens (B) | Training Weight | Epoch |
|---|---|---|---|
| Common Crawl (filtered) | 410 | 60% | 0.44 |
| WebText2 | 19 | 22% | 2.9 |
| Books1 | 12 | 8% | 1.9 |
| Books2 | 55 | 8% | 0.43 |
| Wikipedia | 3 | 3% | 3.4 |
A notable point is that while Common Crawl accounts for most of the tokens, its training weight is limited to 60%. In contrast, the high-quality WebText2 with only 19B tokens is given a high weight of 22%. This reflects the judgment that data quality is more important than quantity.
Common Crawl Filtering Process:
- Document filtering based on similarity with high-quality reference corpora (WebText, Books, Wikipedia)
- Fuzzy deduplication between documents
- Adding reference corpora to the training data for the final composition
4.5 Benchmark Performance
GPT-3 175B's few-shot performance was impressive across various benchmarks.
Language Modeling:
- PTB (Penn Treebank): 20.50 Perplexity (Zero-shot SOTA)
Question Answering:
- TriviaQA: 71.2% accuracy (Few-shot, competitive with Fine-tuned SOTA)
- NaturalQuestions: 29.9% accuracy (Few-shot)
- WebQuestions: 41.5% accuracy (Few-shot)
Translation:
- WMT14 En to Fr: 25.2 BLEU (Few-shot)
- WMT14 Fr to En: 33.9 BLEU (Few-shot)
- WMT16 En to De: 24.3 BLEU (Few-shot)
SuperGLUE:
- Achieved 71.8 points with Few-shot (surpassing Fine-tuned BERT-Large at 69.0)
- However, did not reach Fine-tuned SOTA (90.0 points)
Arithmetic Reasoning:
- 2-digit addition: 100% accuracy
- 3-digit addition: 80.4% accuracy
- 4-5 digit addition: rapid decline
These results demonstrated a clear scaling effect where performance improves with larger model size and more provided examples.
4.6 GPT-3's Recognized Limitations
The paper also candidly described GPT-3's limitations.
Text Generation Quality: Issues with repetition, loss of coherence, and illogical statements during long document generation Limitations of Few-shot: Underperforming fine-tuning-based models on natural language inference (NLI) and some reading comprehension tasks Absence of Bidirectional Context: An inherent limitation of auto-regressive models, with tasks where bidirectional models like BERT have advantages Sample Efficiency: While humans learn new tasks from one or two examples, GPT-3 requires tens to hundreds of examples Lack of Interpretability: Difficulty understanding the model's decision-making process, and the exact mechanism of in-context learning remains unclear
5. InstructGPT / ChatGPT (2022): Aligning with Human Intent
5.1 Paper Overview
Paper: "Training Language Models to Follow Instructions with Human Feedback" Authors: Long Ouyang, Jeff Wu, Xu Jiang and many others (OpenAI) Released: March 2022 (NeurIPS 2022)
Language models up to GPT-3 had a fundamental problem: the training objective of "next token prediction" did not align with the actual use purpose of "following user instructions usefully and safely." No matter how capable a large language model was, it frequently gave irrelevant answers to questions, generated harmful content, or confidently stated inaccurate information.
InstructGPT is a groundbreaking study that solved this Alignment Problem with RLHF (Reinforcement Learning from Human Feedback). And this technology became the foundation of ChatGPT.
5.2 Definition of the Alignment Problem
The paper classified the problems of existing language models into three categories:
- Lack of Helpfulness: Not following user instructions and generating irrelevant text
- Lack of Truthfulness: Generating factually incorrect information (Hallucination)
- Lack of Harmlessness: Generating harmful or biased content
These three combined form the HHH (Helpful, Honest, Harmless) criteria, and InstructGPT aimed to align the model to these criteria using human feedback.
5.3 RLHF 3-Stage Pipeline
InstructGPT's RLHF pipeline consists of three stages.
Step 1: Supervised Fine-Tuning (SFT)
The first stage is traditional supervised learning. Human labelers directly write ideal responses to prompts, and GPT-3 is fine-tuned with this data.
- Data: Approximately 13,000 (prompt, ideal response) pairs
- Prompt Sources: Prompts written by labelers + prompts submitted by OpenAI API users
- Training: 16 epochs, Cosine Learning Rate Decay
The SFT model provides basic instruction-following capability, but it is not yet complete. The next stage learns human preferences.
Step 2: Reward Model (RM) Training
In the second stage, a Reward Model that quantifies human preferences is trained.
Data Collection Process:
- Generate different responses for one prompt using the SFT model ( ranges from 4 to 9)
- Human labelers rank the responses by preference
- Generate comparison pairs
Reward Model Loss Function:
Here, is the scalar output of the Reward Model for prompt and response , is the preferred response, is the non-preferred response, and is the Sigmoid function.
This loss function is based on the Bradley-Terry model, training so that the reward of the preferred response is higher than the non-preferred response. Efficiency was improved by creating comparison pairs from a single prompt and computing them in a single forward pass.
- Data Scale: Comparison data collected from approximately 33,000 prompts
- Model Size: 6B parameters (removing the final unembedding layer from the SFT model and adding a scalar output head)
Step 3: Reinforcement Learning with PPO
In the third stage, the SFT model is optimized using the PPO (Proximal Policy Optimization) algorithm with the trained Reward Model as the reward signal.
PPO Optimization Objective:
Where:
- : The RL policy being trained (language model)
- : Reference policy from the SFT stage
- : Reward Model output
- : KL Penalty coefficient
- : KL Divergence
Role of KL Divergence Penalty:
The KL Divergence term prevents the model from straying too far from the SFT model during RL training. Without this constraint, the model can exploit loopholes in the Reward Model to obtain high rewards while actually generating meaningless text -- a phenomenon known as Reward Hacking.
The exact form of the KL Divergence is:
In practice, this KL Divergence is applied by directly subtracting it from the reward. That is, the modified reward is:
PPO-ptx: Pre-training Mix
InstructGPT additionally proposed the PPO-ptx variant, which mixes the language modeling objective on the original pre-training data as an auxiliary loss during RL training.
Here, is the weight of the pre-training loss. This term prevents the degradation of the model's general language capabilities during RL training ("Alignment Tax").
5.4 Remarkable Result: Small Model Beats Large Model
InstructGPT's most remarkable result is that 1.3B parameter InstructGPT was preferred over 175B parameter GPT-3 in human evaluations. A model with more than 100 times fewer parameters generated more useful, more truthful, and more harmless responses.
Key Experimental Results:
- InstructGPT outputs overwhelmingly preferred over GPT-3 outputs in human evaluation
- Similar or slightly lower performance compared to GPT-3 on public NLP benchmarks (Alignment Tax)
- Significant improvement of PPO model over GPT-3 on TruthfulQA
- Approximately 25% reduction in toxicity generation compared to GPT-3
This result showed that training methodology matters more than model size. "Making it bigger" is not the only answer -- "aligning it with human intent" is the key lesson.
5.5 From InstructGPT to ChatGPT
InstructGPT's technology became the core foundation of ChatGPT, released in November 2022. ChatGPT is a model that applied conversational RLHF to GPT-3.5 (an improved version of GPT-3).
ChatGPT's release was a turning point in AI history. Reaching 1 million users in 5 days and 100 million users in 2 months, it ushered in an era where AI directly reached the general public. Without InstructGPT's technical contributions, this revolution would have been impossible.
6. GPT-4 (2023): Multimodal and Predictable Scaling
6.1 Paper Overview
Paper: "GPT-4 Technical Report" Authors: OpenAI Released: March 2023 (arXiv: 2303.08774)
The GPT-4 Technical Report is fundamentally different from previous GPT papers. Most key information including architecture, model size, training data, and training costs is undisclosed. OpenAI cited "competitive landscape and safety considerations" as reasons for not disclosing this information. This was widely criticized for the disconnect with the "Open" in OpenAI.
Nevertheless, the paper contains several important technical contributions.
6.2 Multimodal Input
The most notable new capability of GPT-4 is that it can accept both images and text as input simultaneously. Output is still limited to text only.
Examples of Multimodal Capabilities:
- Recognition and interpretation of text within images
- Data analysis of charts and graphs
- Description of humor images and interpretation of their humor
- Interpretation of scientific diagrams and solving related problems
This multimodal capability later evolved into GPT-4V (Vision) and was applied to actual services.
6.3 Predictable Scaling
The most important technical contribution of the GPT-4 paper is the Predictable Scaling methodology.
The core idea is that the performance of a large model can be accurately predicted from the performance of small models. OpenAI measured the performance of smaller models trained with the same methodology as GPT-4, predicted GPT-4's final performance from this, and compared it with actual training results.
Loss Prediction: From the training of models using 1,000x to 10,000x less compute, GPT-4's final loss was predicted using a Power Law. The actual training result was very close to the prediction.
HumanEval Coding Performance Prediction: The pass rate on a coding benchmark could also be predicted from smaller model results. This suggests that not only loss but specific task performance is predictable.
The practical value of this Predictable Scaling methodology is immense. Before committing to large-scale model training costing tens of millions to hundreds of millions of dollars, small-scale experiments can predict the final performance to evaluate return on investment in advance.
However, the paper acknowledged that phenomena such as inverse scaling and sudden emergent abilities are hard to predict. In particular, emergent abilities -- where specific capabilities suddenly appear at a certain scale -- are a major exception to Predictable Scaling.
6.4 Professional Exam Performance
GPT-4 demonstrated impressive performance on various professional exams designed for humans. The model received no specific training for these exams.
| Exam | GPT-4 Score/Percentile | GPT-3.5 Score/Percentile | Note |
|---|---|---|---|
| Uniform Bar Exam (MBE+MEE+MPT) | ~298/400 (top 10%) | ~213/400 (bottom 10%) | US Bar Exam |
| LSAT | 163 (top 12%) | 149 (bottom 40%) | Law School Admission |
| SAT Evidence-Based R&W | 710/800 (93rd) | 670/800 (87th) | US College Admission |
| SAT Math | 700/800 (89th) | 590/800 (70th) | US College Admission |
| GRE Quantitative | 163/170 (80th) | 157/170 (62nd) | Graduate Admission |
| GRE Verbal | 169/170 (99th) | 154/170 (63rd) | Graduate Admission |
| AP Biology | 5 (85~100th) | 4 (62~85th) | AP Biology |
| AP Chemistry | 4 (71~88th) | 2 (22~46th) | AP Chemistry |
| AP Calculus BC | 4 (43~59th) | 1 (0~7th) | AP Calculus |
| AP English Literature | 2 (8~22nd) | 2 (8~22nd) | AP English Literature |
Notable patterns:
- Dramatic performance improvement over GPT-3.5 in law, science, and mathematics (Bar Exam: bottom 10% to top 10%)
- Relatively weak performance in language/literature (AP English Literature: bottom 22%)
- Mathematical reasoning improved but still not top-tier (AP Calculus BC: 43~59th percentile)
6.5 Safety and Alignment Improvements
GPT-4 was significantly improved in safety compared to GPT-3.5.
RLHF-based Safety Training:
- Introduced additional safety reward signals in the training process
- Used GPT-4 Zero-shot Classifier to judge safety boundaries and response styles
- Applied safety rewards to both allowed/disallowed categories to prevent over-refusal of valid requests
Quantitative Improvements:
- 82% reduction in response rate to disallowed content requests compared to GPT-3.5
- 29% improvement in policy compliance for sensitive requests (medical advice, self-harm, etc.)
- 40% higher score on internal adversarial factuality evaluation compared to GPT-3.5
- Improvement from approximately 60% to 80% on TruthfulQA after RLHF
Expert Red-teaming:
- Over 50 domain experts (AI safety, cybersecurity, biological risks, international security, etc.) participated in adversarial testing
- Evaluation of high-risk scenarios (autonomous replication, chemical/biological weapons information, etc.)
6.6 GPT-4's Limitations
The limitations explicitly acknowledged in the paper are:
- Hallucination: Can still "confidently" generate factually incorrect information. Greatly improved by RLHF but not fully resolved.
- Context Window Limitation: Limited to 8K/32K tokens at training time, limiting very long document processing.
- Training Data Cutoff: Does not know information after the training data cutoff (trained on data up to September 2021).
- Incomplete Reasoning: Can make mistakes in complex multi-step reasoning, especially in mathematical proofs and subtle code bugs.
- Bias and Calibration: Social biases have not been fully removed, and the model's confidence does not necessarily match actual accuracy.
7. In-depth Analysis of Scaling Laws
7.1 Kaplan Scaling Laws (2020)
"Scaling Laws for Neural Language Models" published by Jared Kaplan and others at OpenAI contemporaneously with GPT-3 provided the theoretical foundation for large language model research.
Key Finding -- Power Law Relationships:
The cross-entropy loss of a language model has a Power Law relationship with the number of model parameters , dataset size , and compute used for training.
These relationships hold over more than 7 orders of magnitude and show very stable trend lines.
Compute-optimal Allocation (Kaplan Version):
To minimize loss with a fixed compute budget , the conclusion was that it is optimal to increase model size while using relatively less data. Specifically, when compute increases 10x, it is most efficient to increase model size by 5.5x and data by only 1.8x.
This result led to the interpretation that "increasing model size is more efficient than increasing data," and served as justification for GPT-3's 175B parameter scale.
7.2 Chinchilla Scaling Laws (2022)
An important correction to Kaplan's Scaling Laws was presented in DeepMind's 2022 "Training Compute-Optimal Large Language Models" (known as the Chinchilla paper).
Key Finding: Existing models are under-trained.
Unlike Kaplan's conclusion, the Chinchilla paper argued that model size and training data should be increased at nearly equal rates. Specifically, approximately 20 training tokens per parameter is compute-optimal.
By this criterion, GPT-3 (175B parameters, 300B tokens) was data-starved. Compute-optimal training would have required approximately 3.5T (3.5 trillion) tokens.
Chinchilla vs. GPT-3:
| Item | GPT-3 | Chinchilla |
|---|---|---|
| Parameters | 175B | 70B |
| Training Tokens | 300B | 1.4T |
| Token/Parameter Ratio | 1.7 | 20 |
| MMLU Performance | 70.0% | 73.4% |
| Compute | ~3,640 PF-days | ~5,200 PF-days |
Chinchilla is a 2.5x smaller model than GPT-3 but achieved higher performance by training on 4.7x more data. This result fundamentally influenced the direction of subsequent large-scale model training.
7.3 Impact of Scaling Laws on GPT-4
GPT-4's Predictable Scaling is a direct application of this Scaling Laws research. If the loss of small models follows a Power Law, then the trend line can be extrapolated to predict the loss of large models.
What the GPT-4 paper showed is that this prediction is surprisingly accurate. This suggests that Scaling Laws are not merely empirical observations but reflect deep structural properties of the language model training process.
However, there are important limitations to this predictability:
- Loss is not equal to Capability: Reduction in overall loss may not directly translate to improvement in specific abilities
- Emergent Abilities: Abilities that suddenly appear at a certain scale are difficult to predict with Power Laws
- Inverse Scaling: In some tasks, performance decreases as the model grows larger
- Task-specific Variability: Scaling efficiency varies significantly across tasks
8. Overall Architecture Comparison
8.1 Generation-by-Generation Architecture Comparison Table
| Item | GPT-1 | GPT-2 (XL) | GPT-3 (175B) | InstructGPT | GPT-4 |
|---|---|---|---|---|---|
| Release Date | 2018.06 | 2019.02 | 2020.05 | 2022.03 | 2023.03 |
| Parameters | 117M | 1,542M | 175,000M | 1,300M~175,000M | Undisclosed |
| Layers | 12 | 48 | 96 | 96 (175B basis) | Undisclosed |
| Hidden Dim | 768 | 1,600 | 12,288 | 12,288 (175B basis) | Undisclosed |
| Attention Heads | 12 | 25 | 96 | 96 (175B basis) | Undisclosed |
| Head Dimension | 64 | 64 | 128 | 128 (175B basis) | Undisclosed |
| Context Window | 512 | 1,024 | 2,048 | 2,048 | 8,192 / 32,768 |
| Vocabulary Size | 40,000 | 50,257 | 50,257 | 50,257 | ~100,000 (est.) |
| Training Data | BooksCorpus (5GB) | WebText (40GB) | Mixed (570GB) | GPT-3 + Human Feedback | Undisclosed |
| Training Tokens | ~1B (est.) | ~10B (est.) | 300B | 300B + RLHF | Undisclosed |
| Tokenization | BPE (40K merges) | Byte-level BPE | Byte-level BPE | Byte-level BPE | Undisclosed |
| Positional Enc. | Learned | Learned | Learned | Learned | Undisclosed |
| Activation | GELU | GELU | GELU | GELU | Undisclosed |
| LayerNorm | Post-norm | Pre-norm | Pre-norm | Pre-norm | Undisclosed |
| Training Method | LM + Fine-tuning | LM only | LM only | LM + SFT + RLHF | LM + SFT + RLHF |
| Multimodal | No | No | No | No | Yes (Image Input) |
| Sparse Attention | No | No | Yes (partial) | Yes (partial) | Undisclosed |
8.2 Evolution of Paradigms
More important than the architecture itself is the evolution of paradigms.
GPT-1: Pre-train -> Fine-tune (fine-tuning required for each task)
|
GPT-2: Pre-train -> Zero-shot (direct use without fine-tuning)
|
GPT-3: Pre-train -> In-context Learning (task performance with examples only)
|
InstructGPT: Pre-train -> SFT -> RLHF (alignment with human feedback)
|
GPT-4: Pre-train -> SFT -> RLHF + Multimodal (multimodal + enhanced safety)
The consistent direction of this evolution is reducing user intervention. GPT-1 required training data and fine-tuning for each task, but by GPT-4, nearly all tasks can be performed with natural language instructions alone.
9. GPT's Impact: Transformation of the AI Ecosystem
9.1 ChatGPT and AI Democratization
The most direct impact of the GPT series is the democratization of AI through ChatGPT.
ChatGPT Growth Metrics:
- Released November 30, 2022
- 1 million users in 5 days
- 100 million users in 2 months (fastest record ever, surpassing TikTok's 9 months)
- Over 700 million weekly active users by end of 2024
ChatGPT transformed the concept of "AI" from an exclusive domain of researchers and developers to an everyday tool for the general public. This transformation would have been impossible without InstructGPT's RLHF technology.
9.2 API Economy and AI-native Services
GPT-3's API release (June 2020) marked the beginning of the AI API Economy.
New Business Models:
- Wrapper Services: Building specialized UX on top of the GPT API (Jasper, Copy.ai, etc.)
- Vertical AI: AI solutions optimized for specific domains (Harvey for Law, Hippocratic AI for Healthcare)
- AI-augmented SaaS: Integrating AI features into existing SaaS (Notion AI, GitHub Copilot, etc.)
- Agent Frameworks: Autonomous agents using GPT as the core reasoning engine (AutoGPT, LangChain, etc.)
9.3 Academic Impact
The GPT series also had a fundamental impact on the direction of academic research.
Birth of New Research Fields:
- Prompt Engineering: Research on prompt design to maximize the effectiveness of in-context learning
- Alignment Research: Various alignment techniques beyond RLHF (DPO, ORPO, Constitutional AI, etc.)
- Mechanistic Interpretability: Research to understand the internal workings of large models
- Scaling Laws: Quantitative analysis of the relationship between model performance and resources
- Evaluation: Recognizing the limitations of existing benchmarks and developing new evaluation methodologies
Changes in Research Methodology:
- Shift in research focus from "model architecture innovation" to "data, training methods, alignment"
- Growing gap between academic and industrial research due to increasing compute requirements
- Partial recovery of academic accessibility through open-source models (LLaMA, Mistral, etc.)
9.4 Impact on Industry and Society
- Education: AI tutors, automated grading, personalized learning content generation
- Healthcare: Medical document writing assistance, diagnostic support, drug interaction analysis
- Law: Case search, contract analysis, legal advice drafting
- Software Development: Code generation, debugging, documentation (GitHub Copilot)
- Content Creation: Writing assistance, translation, summarization, idea generation
10. Limitations and Criticisms
10.1 Hallucination
The most serious limitation of the GPT series is the Hallucination problem -- confidently generating information that is factually incorrect.
Types of Hallucination:
- Factual Errors: Non-existent citations, incorrect statistics, fabricated historical facts
- Logical Leaps: Jumping from premises to conclusions without valid reasoning
- Self-contradiction: Making contradictory claims within the same conversation
Root Causes:
- Auto-regressive models simply generate "plausible next tokens" without verifying factual accuracy
- Training data contains errors, and the model cannot distinguish them
- RLHF may encourage confident errors by rewarding "speaking confidently"
GPT-4 reduced hallucination by approximately 40% compared to GPT-3.5 through RLHF, but complete resolution remains elusive. This is one of the most active research areas in current LLM research.
10.2 Bias
Large language models reflect and sometimes amplify social biases inherent in their training data.
Types of Bias:
- Gender Bias: Reflection of stereotypes in occupations, personality traits, etc.
- Racial/Ethnic Bias: Negative associations with specific races
- Cultural Bias: English-speaking, particularly US-centric worldview
- Socioeconomic Bias: Overrepresentation of certain class perspectives
The GPT-3 paper explicitly acknowledged this and included bias analysis related to Gender, Race, and Religion. InstructGPT and GPT-4 attempted to reduce bias through RLHF, but completely eliminating bias inherent in training data remains a fundamentally challenging problem.
10.3 Environmental Cost
The environmental cost of large-scale model training is becoming an increasingly significant concern.
Estimated Training Carbon Emissions:
- GPT-3: Approximately 552 tons CO2e (equivalent to the annual emissions of about 120 average US cars)
- GPT-4: Estimated at approximately 15,000 tons CO2e (unofficial estimate, about 27x GPT-3)
Water Consumption:
- Microsoft reportedly used approximately 700,000 liters of freshwater for data center cooling during GPT-3 training
Criticism and Counterarguments:
- While the cost of a single training run is large, the trained model is used by hundreds of millions, so the per-person cost is negligible
- Model efficiency improvements (Distillation, Quantization, Pruning) and hardware advances are reducing costs
- However, concerns about Jevons Paradox (where efficiency improvements actually increase total consumption) also exist
10.4 Transparency and Reproducibility
One of the most persistent criticisms of the GPT series is lack of transparency.
- GPT-1: Paper, code, and model released (relatively open)
- GPT-2: Paper released, model released in stages ("too dangerous" controversy)
- GPT-3: Paper released, model accessible only via API
- GPT-4: Architecture, data, training cost, and other key information undisclosed
This trend has deepened the disconnect with the organization's name "Open" AI and seriously undermined academic reproducibility. In response, the importance of open models such as Meta's LLaMA and Mistral AI's Mistral/Mixtral has become more prominent.
10.5 Economic Inequality and Compute Divide
The concentration of resources needed for large-scale model training exacerbates economic inequality in AI research.
- GPT-3 training cost: Approximately $4.6 million (estimated)
- GPT-4 training cost: Over approximately $100 million (estimated)
- Investments of this scale are possible only for a few large corporations, structurally excluding universities and small research labs
11. Summary: The Legacy of GPT
The key insights running through the five papers of the GPT series can be summarized as follows:
1. Scale is (almost) all you need
The scaling from GPT-1 (117M) to GPT-2 (1.5B) to GPT-3 (175B) was not simply "the same thing but bigger" -- it led to qualitatively new emergent abilities. Zero-shot, in-context learning, and complex reasoning were Emergent Abilities that appear only at sufficient scale.
2. Alignment changes everything
InstructGPT showed that training methodology can matter more than model size. The 1.3B InstructGPT beating 175B GPT-3 demonstrated that there is a large gap between raw capability and usefulness, and RLHF can bridge that gap.
3. The bitter lesson revisited
Rich Sutton's "The Bitter Lesson" -- general methods + more compute beat specialized methods -- was repeatedly confirmed in the GPT series. General-purpose Transformer + large-scale pre-training was overwhelmingly more effective than task-specific architectures.
4. Data is the new bottleneck
After Chinchilla's lesson, the quantity and quality of training data emerged as a key bottleneck alongside model size. High-quality text on the internet is finite, and Synthetic Data generation is emerging as a new research direction.
5. Safety is not optional
From GPT-2's "too dangerous to release" controversy to GPT-4's red-teaming, safety has become mandatory, not optional. As AI models become more powerful, the importance of safe and responsible development grows proportionally.
The GPT series is not yet over. What capabilities GPT-5 and beyond will show remains unknown, but one thing is certain: the paradigm of "large-scale pre-training + human feedback alignment" established by the GPT series has become the foundation of modern AI, and understanding it is essential for understanding the future of AI.
12. References
GPT-1: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI Paper
GPT-2: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Paper
GPT-3: Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. arXiv:2005.14165
InstructGPT: Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. arXiv:2203.02155
GPT-4: OpenAI. (2023). "GPT-4 Technical Report." arXiv:2303.08774
Scaling Laws: Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361
Chinchilla: Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556
Transformer: Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
PPO: Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347
RLHF: Christiano, P. F., Leike, J., Brown, T., et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. arXiv:1706.03741
Sparse Transformer: Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv:1904.10509
BPE: Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. arXiv:1508.07909
Carbon Footprint: Patterson, D., Gonzalez, J., Le, Q., et al. (2021). "Carbon Emissions and Large Neural Network Training." arXiv:2104.10350
Related Series and Recommended Posts
- Build Your Own GPT -- Training from Scratch with nanoGPT -- Code a GPT yourself
- Complete Math Guide for AI -- Math needed to understand Transformers
- Attention Is All You Need Analysis -- The original Transformer paper
- BERT Analysis -- The Encoder model rivaling GPT
- RWKV: Reinventing RNNs -- Alternative architecture to Transformers
- vLLM Inference Optimization -- Serving GPT models
- LLM Quantization GPTQ/AWQ/GGUF -- Making large models lightweight
GitHub
- nanoGPT -- Andrej Karpathy
- ai-model-analysis -- Code-level analysis collection from this blog