Skip to content
Published on

Complete Analysis of the GPT Series Papers: The Journey from GPT-1 to GPT-4, How Language Models Changed the World

Authors
  • Name
    Twitter

1. GPT Series Overview and Timeline

GPT (Generative Pre-trained Transformer) is a series of Large Language Models (LLMs) published by OpenAI since 2018. True to its name "Generative Pre-trained Transformer," it established the paradigm of performing unsupervised pre-training on large-scale text data based on the Transformer Decoder architecture, then applying it to various downstream tasks.

The GPT series did not simply grow in model size -- each generation redefined how language models are utilized. The journey in chronological order is as follows:

GenerationReleasePaper TitleKey KeywordsParameters
GPT-12018.06Improving Language Understanding by Generative Pre-TrainingUnsupervised Pre-training + Supervised Fine-tuning117M
GPT-22019.02Language Models are Unsupervised Multitask LearnersZero-shot Transfer, WebText1.5B
GPT-32020.05Language Models are Few-Shot LearnersIn-context Learning, Scaling Laws175B
InstructGPT2022.03Training Language Models to Follow Instructions with Human FeedbackRLHF, Human Alignment1.3B~175B
GPT-42023.03GPT-4 Technical ReportMultimodal, Predictable ScalingUndisclosed

It is noteworthy that each generation's paper title carries its core message. GPT-1 declared "improving language understanding through generative pre-training," GPT-2 claimed "language models are unsupervised multitask learners," and GPT-3 went a step further with "language models are few-shot learners." InstructGPT presented a practical direction of "training to follow instructions with human feedback," and GPT-4 was simply published as a "technical report," hinting at its commercial transition.

In this article, we analyze each paper's key contributions, architecture details, training methodology, and impact on subsequent research, together with equations.


2. GPT-1 (2018): The Beginning of Generative Pre-Training

2.1 Paper Overview

Paper: "Improving Language Understanding by Generative Pre-Training" Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (OpenAI) Released: June 2018

The core idea of GPT-1 is surprisingly simple. Pre-train a language model on large-scale unlabeled text, then fine-tune it on a specific task with a small amount of labeled data. This two-stage approach (Semi-supervised Learning) transformed the NLP landscape at the time.

In 2018, NLP was dominated by task-specific architectures. Designing separate models for each task such as sentiment analysis, question answering, and textual entailment, and training them only on task-specific labeled data was standard. GPT-1 proposed a new path of "general-purpose pre-training" to this paradigm.

2.2 Architecture Details

GPT-1 adopted an architecture using only the Decoder blocks of the Transformer. While the original Transformer (Vaswani et al., 2017) had an Encoder-Decoder structure, GPT-1 chose a Decoder-only structure suitable for auto-regressive language modeling.

Model Configuration:

  • Number of Layers: 12 Transformer Decoder blocks
  • Hidden Dimension: 768
  • Number of Attention Heads: 12 (64 dimensions each)
  • Feed-Forward Dimension: 3,072 (=768×4= 768 \times 4)
  • Context Window: 512 tokens
  • Total Parameters: Approximately 117M (117 million)
  • Activation Function: GELU (Gaussian Error Linear Unit)
  • Positional Encoding: Learned Positional Embedding

Instead of the fixed Sinusoidal Positional Encoding used in the original Transformer, GPT-1 adopted learned positional embeddings. This allowed the model to learn positional information directly from data, enabling more flexible adaptation to various tasks.

2.3 Stage 1: Unsupervised Pre-training

In the pre-training stage, the standard language modeling objective is optimized over a large-scale unlabeled text corpus U={u1,u2,...,un}\mathcal{U} = \{u_1, u_2, ..., u_n\}.

L1(U)=ilogP(uiuik,...,ui1;Θ)L_1(\mathcal{U}) = \sum_i \log P(u_i \mid u_{i-k}, ..., u_{i-1}; \Theta)

Here, kk is the context window size and Θ\Theta represents the model parameters. This is a typical Auto-regressive Language Modeling objective that maximizes the probability of the next token given the previous kk tokens.

Specifically, each token's representation is computed as follows:

h0=UWe+Wph_0 = UW_e + W_p hl=transformer_block(hl1),l[1,n]h_l = \text{transformer\_block}(h_{l-1}), \quad l \in [1, n] P(u)=softmax(hnWeT)P(u) = \text{softmax}(h_n W_e^T)

Here, U=(uk,...,u1)U = (u_{-k}, ..., u_{-1}) is the context token vector, WeW_e is the token embedding matrix, and WpW_p is the position embedding matrix. Output probabilities are computed by reusing the token embedding matrix WeW_e (Weight Tying).

Training Data: The BooksCorpus dataset was used, consisting of approximately 7,000 unpublished books containing about 5GB of text. The abundance of long-form text made it suitable for learning long-range dependencies.

Tokenization: BPE (Byte Pair Encoding) was used with 40,000 merges to construct the vocabulary.

Optimization: The Adam Optimizer was used with a learning rate that linearly increased from 0 to 2.5×1042.5 \times 10^{-4} during the first 2,000 steps (Linear Warmup), then decreased with Cosine Annealing. Batch Size was 64, trained for 100 epochs.

2.4 Stage 2: Supervised Fine-tuning

To apply the pre-trained model to a specific task, it is fine-tuned with labeled data C\mathcal{C}. Given an input token sequence x1,...,xmx_1, ..., x_m with corresponding label yy, the following objective is optimized:

L2(C)=(x,y)logP(yx1,...,xm)L_2(\mathcal{C}) = \sum_{(x,y)} \log P(y \mid x_1, ..., x_m)

Here, P(yx1,...,xm)=softmax(hlmWy)P(y \mid x_1, ..., x_m) = \text{softmax}(h_l^m W_y), where hlmh_l^m is the last token output of the final Transformer block, and WyW_y is the weight of the task-specific Linear Head.

Key Technique -- Auxiliary Language Modeling Objective: GPT-1 also used the original language modeling objective as an auxiliary loss during fine-tuning. This had the effect of improving generalization performance and accelerating convergence.

L3(C)=L2(C)+λL1(C)L_3(\mathcal{C}) = L_2(\mathcal{C}) + \lambda \cdot L_1(\mathcal{C})

Here, λ\lambda is the weight of the auxiliary loss, and the paper used λ=0.5\lambda = 0.5.

2.5 Task-specific Input Transformation

Another important contribution of GPT-1 was presenting input transformation techniques to handle various tasks with a single Transformer architecture. Without changing the architecture itself, it adapted to multiple tasks by only changing the input format.

  • Text Classification: Input as [Start] text [Extract] and apply a Linear Layer to the last token's output
  • Textual Entailment: Connect two sentences as [Start] premise [Delimiter] hypothesis [Extract]
  • Semantic Similarity: Reverse the order of two sentences to create two inputs, and element-wise add their outputs
  • Multiple Choice: Individually concatenate each choice with context to create multiple sequences, and normalize with Softmax

This approach was very practical in that it could be applied to various tasks with minimal changes to the model architecture. The only additional parameters were the delimiter token embeddings and the final Linear Layer weights WyW_y.

2.6 Experimental Results and Significance

GPT-1 achieved state-of-the-art results on 9 out of 12 NLP benchmarks. In particular, it significantly outperformed existing models in Commonsense Reasoning (86.5% accuracy on Stories Cloze Test), Semantic Similarity (70.3 F1 on QQP), and Question Answering (59.0% accuracy on RACE).

However, the true significance of GPT-1 lies not in individual benchmark performance but in establishing the paradigm of "large-scale unsupervised pre-training + small-scale supervised fine-tuning." This paradigm continued with BERT, RoBERTa, T5, and others, becoming the standard in NLP.


3. GPT-2 (2019): The Possibility of Zero-shot Learning

3.1 Paper Overview

Paper: "Language Models are Unsupervised Multitask Learners" Authors: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever (OpenAI) Released: February 2019

The paper title of GPT-2 carries a bold claim: "Language models are unsupervised multitask learners." That is, despite being trained with a single objective of language modeling, the model can perform multiple tasks without separate fine-tuning.

While GPT-1 required two stages of "pre-training then fine-tuning," GPT-2 demonstrated that tasks can be performed zero-shot without fine-tuning. This was a fundamental paradigm shift.

3.2 Core Idea: Task as Language Modeling

The core insight of GPT-2 is that all NLP tasks can be reformulated as conditional language modeling.

Traditional supervised learning learns the conditional probability P(outputinput)P(\text{output} \mid \text{input}). GPT-2 extends this to the form P(outputinput,task)P(\text{output} \mid \text{input}, \text{task}), providing task information in natural language.

For example:

  • Translation: Expressing sequences of the form (translate to french, english text, french text) as natural text
  • Summarization: Appending TL;DR: after text to elicit a summary
  • Question Answering: Providing context and questions in natural language to generate answers

The key to this idea is that if a sufficiently large language model learns sufficiently diverse text, task performance capabilities naturally emerge.

3.3 Architecture Details

GPT-2 is based on the GPT-1 architecture with several important modifications.

Key Changes:

  • Layer Normalization Position Change: Moved to the input side of each sub-block (Pre-norm)
  • Additional Layer Normalization: Added after the final Self-attention block
  • Residual Weight Initialization: Residual path weights scaled by 1/N1/\sqrt{N} (NN is the number of Residual Layers)
  • Context Window Expansion: 512 to 1,024 tokens
  • Vocabulary Size Expansion: 40,000 to 50,257 (Byte-level BPE)
  • Batch Size Expansion: 64 to 512

GPT-2 trained four model sizes:

ModelParametersLayersHidden DimHeadsHead Dim
Small117M127681264
Medium345M241,0241664
Large762M361,2802064
XL1,542M481,6002564

The Head Dimension is fixed at 64 across all models, and the Feed-forward Layer dimension is always 4 times the Hidden Dimension (dff=4×dmodeld_{ff} = 4 \times d_{model}).

3.4 WebText Dataset

Another key contribution of GPT-2 is the WebText training dataset.

Data Construction Method:

  1. Collected external links with 3+ Karma on Reddit (effectively human-vetted quality)
  2. Collected approximately 45 million links
  3. Extracted text from HTML using Dragnet and Newspaper libraries
  4. Deduplication and heuristic-based cleaning

Dataset Characteristics:

  • Approximately 8 million documents
  • Approximately 40GB of text
  • Wikipedia was intentionally excluded (to prevent data leakage with evaluation datasets)

The design philosophy of WebText was "leverage human curation while avoiding explicit labeling costs." The idea of using Reddit's Karma system as a quality filter inspired many subsequent dataset constructions.

3.5 Byte-level BPE

GPT-2 also introduced an important innovation in tokenization. While existing BPE operates at the Unicode character level, GPT-2 applied BPE at the byte level.

Advantages of this approach:

  • Complete Coverage: Since any byte sequence can be encoded, the OOV (Out-of-Vocabulary) problem is fundamentally eliminated
  • Multilingual Support: Various languages and special characters can be processed without separate preprocessing
  • Base Vocabulary Size: 256 (number of bytes) + special tokens

However, since naive byte-level BPE generates many inefficient merges, GPT-2 added rules to prevent merging characters of different categories. The final vocabulary size is 50,257.

3.6 Zero-shot Performance and Scaling

GPT-2's zero-shot performance consistently improved with model size. This was a precursor to the later Scaling Laws research.

Key Zero-shot Results:

  • Language Modeling: State-of-the-art on 7 out of 8 Language Modeling benchmarks (including domains not in WebText training)
  • Children's Book Test (Named Entity): 93.3% accuracy (+7% over previous SOTA)
  • LAMBADA: Perplexity 8.6 (drastically improved from previous SOTA of 99.8)
  • Reading Comprehension (CoQA): 55.0 F1 (surpassing 3 out of 4 existing models trained with 127,000 examples)
  • Translation (WMT14 En-Fr): 11.5 BLEU zero-shot (slightly surpassing unsupervised translation SOTA)
  • Summarization (CNN/Daily Mail): Elicited with TL;DR prompt, qualitatively meaningful results

3.7 "Too Dangerous to Release" Controversy

GPT-2 received as much attention for its release policy as for its technical achievements. OpenAI initially decided not to release the 1.5B parameter model, releasing only the smallest 117M model. The reason was "the risk of malicious use (fake news, spam, etc.) is significant."

This decision sparked intense debate in the AI community.

Supporting Arguments:

  • Unrestricted release of powerful text generation models could be exploited for mass production of disinformation
  • A precedent for Responsible Disclosure considering societal impact was needed

Critical Arguments:

  • The danger of the 1.5B parameter model was exaggerated
  • It hinders reproducibility in the academic community
  • Suspicions of marketing-driven exaggeration

Eventually, OpenAI released the full model in November 2019, and the feared large-scale misuse did not materialize. However, this debate became an important catalyst for subsequent AI Safety and Responsible AI discussions.


4. GPT-3 (2020): The Power of In-context Learning and Scaling

4.1 Paper Overview

Paper: "Language Models are Few-Shot Learners" Authors: Tom B. Brown, Benjamin Mann, Nick Ryder and many others (OpenAI) Released: May 2020 (NeurIPS 2020)

GPT-3 is a language model of unprecedented scale with 175 billion (175B) parameters. However, the true innovation of GPT-3 is not its size but establishing the new paradigm of In-context Learning. It proved that various tasks can be performed without updating the model weights at all, simply by including a few examples in the prompt.

4.2 In-context Learning Paradigm

The GPT-3 paper systematically compared three evaluation conditions.

Zero-shot: Only a task description provided in natural language

Translate English to French:
cheese =>

One-shot: Task description + 1 example provided

Translate English to French:
sea otter => loutre de mer
cheese =>

Few-shot: Task description + 10-100 examples provided (within the context window limit)

Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe en peluche
cheese =>

All three conditions involve absolutely no gradient updates. The model performs tasks purely through forward passes. This is the decisive difference from fine-tuning.

The paper's interpretation of why in-context learning works is that during pre-training, the model naturally learns various task patterns, and the examples in the prompt serve to "locate and activate" relevant abilities that already exist within the model.

4.3 Architecture Details

GPT-3 uses essentially the same architecture as GPT-2, but inspired by Sparse Transformer (Child et al., 2019), alternates between Dense and Locally Banded Sparse Attention patterns.

GPT-3 trained 8 model sizes to systematically analyze scaling effects.

Model NameParametersLayersdmodeld_{model}Headsdheadd_{head}Batch SizeLearning Rate
GPT-3 Small125M1276812640.5M6.0×1046.0 \times 10^{-4}
GPT-3 Medium350M241,02416640.5M3.0×1043.0 \times 10^{-4}
GPT-3 Large760M241,53616960.5M2.5×1042.5 \times 10^{-4}
GPT-3 XL1.3B242,048241281M2.0×1042.0 \times 10^{-4}
GPT-3 2.7B2.7B322,56032801M1.6×1041.6 \times 10^{-4}
GPT-3 6.7B6.7B324,096321282M1.2×1041.2 \times 10^{-4}
GPT-3 13B13.0B405,140401282M1.0×1041.0 \times 10^{-4}
GPT-3 175B175.0B9612,288961283.2M0.6×1040.6 \times 10^{-4}

All models use a 2,048 token context window and were trained on a total of 300B (300 billion) tokens. A consistent pattern of decreasing learning rate and increasing batch size with larger models was applied.

4.4 Training Data Composition

GPT-3's training data is a mixture of multiple sources, with the notable characteristic of applying differential training weights based on each source's quality.

DatasetTokens (B)Training WeightEpoch
Common Crawl (filtered)41060%0.44
WebText21922%2.9
Books1128%1.9
Books2558%0.43
Wikipedia33%3.4

A notable point is that while Common Crawl accounts for most of the tokens, its training weight is limited to 60%. In contrast, the high-quality WebText2 with only 19B tokens is given a high weight of 22%. This reflects the judgment that data quality is more important than quantity.

Common Crawl Filtering Process:

  1. Document filtering based on similarity with high-quality reference corpora (WebText, Books, Wikipedia)
  2. Fuzzy deduplication between documents
  3. Adding reference corpora to the training data for the final composition

4.5 Benchmark Performance

GPT-3 175B's few-shot performance was impressive across various benchmarks.

Language Modeling:

  • PTB (Penn Treebank): 20.50 Perplexity (Zero-shot SOTA)

Question Answering:

  • TriviaQA: 71.2% accuracy (Few-shot, competitive with Fine-tuned SOTA)
  • NaturalQuestions: 29.9% accuracy (Few-shot)
  • WebQuestions: 41.5% accuracy (Few-shot)

Translation:

  • WMT14 En to Fr: 25.2 BLEU (Few-shot)
  • WMT14 Fr to En: 33.9 BLEU (Few-shot)
  • WMT16 En to De: 24.3 BLEU (Few-shot)

SuperGLUE:

  • Achieved 71.8 points with Few-shot (surpassing Fine-tuned BERT-Large at 69.0)
  • However, did not reach Fine-tuned SOTA (90.0 points)

Arithmetic Reasoning:

  • 2-digit addition: 100% accuracy
  • 3-digit addition: 80.4% accuracy
  • 4-5 digit addition: rapid decline

These results demonstrated a clear scaling effect where performance improves with larger model size and more provided examples.

4.6 GPT-3's Recognized Limitations

The paper also candidly described GPT-3's limitations.

Text Generation Quality: Issues with repetition, loss of coherence, and illogical statements during long document generation Limitations of Few-shot: Underperforming fine-tuning-based models on natural language inference (NLI) and some reading comprehension tasks Absence of Bidirectional Context: An inherent limitation of auto-regressive models, with tasks where bidirectional models like BERT have advantages Sample Efficiency: While humans learn new tasks from one or two examples, GPT-3 requires tens to hundreds of examples Lack of Interpretability: Difficulty understanding the model's decision-making process, and the exact mechanism of in-context learning remains unclear


5. InstructGPT / ChatGPT (2022): Aligning with Human Intent

5.1 Paper Overview

Paper: "Training Language Models to Follow Instructions with Human Feedback" Authors: Long Ouyang, Jeff Wu, Xu Jiang and many others (OpenAI) Released: March 2022 (NeurIPS 2022)

Language models up to GPT-3 had a fundamental problem: the training objective of "next token prediction" did not align with the actual use purpose of "following user instructions usefully and safely." No matter how capable a large language model was, it frequently gave irrelevant answers to questions, generated harmful content, or confidently stated inaccurate information.

InstructGPT is a groundbreaking study that solved this Alignment Problem with RLHF (Reinforcement Learning from Human Feedback). And this technology became the foundation of ChatGPT.

5.2 Definition of the Alignment Problem

The paper classified the problems of existing language models into three categories:

  1. Lack of Helpfulness: Not following user instructions and generating irrelevant text
  2. Lack of Truthfulness: Generating factually incorrect information (Hallucination)
  3. Lack of Harmlessness: Generating harmful or biased content

These three combined form the HHH (Helpful, Honest, Harmless) criteria, and InstructGPT aimed to align the model to these criteria using human feedback.

5.3 RLHF 3-Stage Pipeline

InstructGPT's RLHF pipeline consists of three stages.

Step 1: Supervised Fine-Tuning (SFT)

The first stage is traditional supervised learning. Human labelers directly write ideal responses to prompts, and GPT-3 is fine-tuned with this data.

  • Data: Approximately 13,000 (prompt, ideal response) pairs
  • Prompt Sources: Prompts written by labelers + prompts submitted by OpenAI API users
  • Training: 16 epochs, Cosine Learning Rate Decay

The SFT model provides basic instruction-following capability, but it is not yet complete. The next stage learns human preferences.

Step 2: Reward Model (RM) Training

In the second stage, a Reward Model that quantifies human preferences is trained.

Data Collection Process:

  1. Generate KK different responses for one prompt using the SFT model (KK ranges from 4 to 9)
  2. Human labelers rank the KK responses by preference
  3. Generate (K2)\binom{K}{2} comparison pairs

Reward Model Loss Function:

loss(θ)=1(K2)E(x,yw,yl)D[logσ(rθ(x,yw)rθ(x,yl))]\text{loss}(\theta) = -\frac{1}{\binom{K}{2}} E_{(x, y_w, y_l) \sim D} \left[ \log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right]

Here, rθ(x,y)r_\theta(x, y) is the scalar output of the Reward Model for prompt xx and response yy, ywy_w is the preferred response, yly_l is the non-preferred response, and σ\sigma is the Sigmoid function.

This loss function is based on the Bradley-Terry model, training so that the reward of the preferred response is higher than the non-preferred response. Efficiency was improved by creating (K2)\binom{K}{2} comparison pairs from a single prompt and computing them in a single forward pass.

  • Data Scale: Comparison data collected from approximately 33,000 prompts
  • Model Size: 6B parameters (removing the final unembedding layer from the SFT model and adding a scalar output head)

Step 3: Reinforcement Learning with PPO

In the third stage, the SFT model is optimized using the PPO (Proximal Policy Optimization) algorithm with the trained Reward Model as the reward signal.

PPO Optimization Objective:

objective(ϕ)=E(x,y)DπϕRL[rθ(x,y)βDKL(πϕRL(yx)πSFT(yx))]\text{objective}(\phi) = E_{(x, y) \sim D_{\pi_\phi^{RL}}} \left[ r_\theta(x, y) - \beta \cdot D_{KL}(\pi_\phi^{RL}(y \mid x) \| \pi^{SFT}(y \mid x)) \right]

Where:

  • πϕRL\pi_\phi^{RL}: The RL policy being trained (language model)
  • πSFT\pi^{SFT}: Reference policy from the SFT stage
  • rθ(x,y)r_\theta(x, y): Reward Model output
  • β\beta: KL Penalty coefficient
  • DKLD_{KL}: KL Divergence

Role of KL Divergence Penalty:

The KL Divergence term prevents the model from straying too far from the SFT model during RL training. Without this constraint, the model can exploit loopholes in the Reward Model to obtain high rewards while actually generating meaningless text -- a phenomenon known as Reward Hacking.

The exact form of the KL Divergence is:

DKL(πϕRL(x)πSFT(x))=yπϕRL(yx)logπϕRL(yx)πSFT(yx)D_{KL}(\pi_\phi^{RL}(\cdot \mid x) \| \pi^{SFT}(\cdot \mid x)) = \sum_y \pi_\phi^{RL}(y \mid x) \log \frac{\pi_\phi^{RL}(y \mid x)}{\pi^{SFT}(y \mid x)}

In practice, this KL Divergence is applied by directly subtracting it from the reward. That is, the modified reward is:

R(x,y)=rθ(x,y)βlogπϕRL(yx)πSFT(yx)R(x, y) = r_\theta(x, y) - \beta \cdot \log \frac{\pi_\phi^{RL}(y \mid x)}{\pi^{SFT}(y \mid x)}

PPO-ptx: Pre-training Mix

InstructGPT additionally proposed the PPO-ptx variant, which mixes the language modeling objective on the original pre-training data as an auxiliary loss during RL training.

objective(ϕ)=E(x,y)DπϕRL[rθ(x,y)βDKL(πϕRLπSFT)]+γExDpretrain[logπϕRL(x)]\text{objective}(\phi) = E_{(x, y) \sim D_{\pi_\phi^{RL}}} \left[ r_\theta(x, y) - \beta \cdot D_{KL}(\pi_\phi^{RL} \| \pi^{SFT}) \right] + \gamma \cdot E_{x \sim D_{\text{pretrain}}} \left[ \log \pi_\phi^{RL}(x) \right]

Here, γ\gamma is the weight of the pre-training loss. This term prevents the degradation of the model's general language capabilities during RL training ("Alignment Tax").

5.4 Remarkable Result: Small Model Beats Large Model

InstructGPT's most remarkable result is that 1.3B parameter InstructGPT was preferred over 175B parameter GPT-3 in human evaluations. A model with more than 100 times fewer parameters generated more useful, more truthful, and more harmless responses.

Key Experimental Results:

  • InstructGPT outputs overwhelmingly preferred over GPT-3 outputs in human evaluation
  • Similar or slightly lower performance compared to GPT-3 on public NLP benchmarks (Alignment Tax)
  • Significant improvement of PPO model over GPT-3 on TruthfulQA
  • Approximately 25% reduction in toxicity generation compared to GPT-3

This result showed that training methodology matters more than model size. "Making it bigger" is not the only answer -- "aligning it with human intent" is the key lesson.

5.5 From InstructGPT to ChatGPT

InstructGPT's technology became the core foundation of ChatGPT, released in November 2022. ChatGPT is a model that applied conversational RLHF to GPT-3.5 (an improved version of GPT-3).

ChatGPT's release was a turning point in AI history. Reaching 1 million users in 5 days and 100 million users in 2 months, it ushered in an era where AI directly reached the general public. Without InstructGPT's technical contributions, this revolution would have been impossible.


6. GPT-4 (2023): Multimodal and Predictable Scaling

6.1 Paper Overview

Paper: "GPT-4 Technical Report" Authors: OpenAI Released: March 2023 (arXiv: 2303.08774)

The GPT-4 Technical Report is fundamentally different from previous GPT papers. Most key information including architecture, model size, training data, and training costs is undisclosed. OpenAI cited "competitive landscape and safety considerations" as reasons for not disclosing this information. This was widely criticized for the disconnect with the "Open" in OpenAI.

Nevertheless, the paper contains several important technical contributions.

6.2 Multimodal Input

The most notable new capability of GPT-4 is that it can accept both images and text as input simultaneously. Output is still limited to text only.

Examples of Multimodal Capabilities:

  • Recognition and interpretation of text within images
  • Data analysis of charts and graphs
  • Description of humor images and interpretation of their humor
  • Interpretation of scientific diagrams and solving related problems

This multimodal capability later evolved into GPT-4V (Vision) and was applied to actual services.

6.3 Predictable Scaling

The most important technical contribution of the GPT-4 paper is the Predictable Scaling methodology.

The core idea is that the performance of a large model can be accurately predicted from the performance of small models. OpenAI measured the performance of smaller models trained with the same methodology as GPT-4, predicted GPT-4's final performance from this, and compared it with actual training results.

Loss Prediction: From the training of models using 1,000x to 10,000x less compute, GPT-4's final loss was predicted using a Power Law. The actual training result was very close to the prediction.

HumanEval Coding Performance Prediction: The pass rate on a coding benchmark could also be predicted from smaller model results. This suggests that not only loss but specific task performance is predictable.

The practical value of this Predictable Scaling methodology is immense. Before committing to large-scale model training costing tens of millions to hundreds of millions of dollars, small-scale experiments can predict the final performance to evaluate return on investment in advance.

However, the paper acknowledged that phenomena such as inverse scaling and sudden emergent abilities are hard to predict. In particular, emergent abilities -- where specific capabilities suddenly appear at a certain scale -- are a major exception to Predictable Scaling.

6.4 Professional Exam Performance

GPT-4 demonstrated impressive performance on various professional exams designed for humans. The model received no specific training for these exams.

ExamGPT-4 Score/PercentileGPT-3.5 Score/PercentileNote
Uniform Bar Exam (MBE+MEE+MPT)~298/400 (top 10%)~213/400 (bottom 10%)US Bar Exam
LSAT163 (top 12%)149 (bottom 40%)Law School Admission
SAT Evidence-Based R&W710/800 (93rd)670/800 (87th)US College Admission
SAT Math700/800 (89th)590/800 (70th)US College Admission
GRE Quantitative163/170 (80th)157/170 (62nd)Graduate Admission
GRE Verbal169/170 (99th)154/170 (63rd)Graduate Admission
AP Biology5 (85~100th)4 (62~85th)AP Biology
AP Chemistry4 (71~88th)2 (22~46th)AP Chemistry
AP Calculus BC4 (43~59th)1 (0~7th)AP Calculus
AP English Literature2 (8~22nd)2 (8~22nd)AP English Literature

Notable patterns:

  • Dramatic performance improvement over GPT-3.5 in law, science, and mathematics (Bar Exam: bottom 10% to top 10%)
  • Relatively weak performance in language/literature (AP English Literature: bottom 22%)
  • Mathematical reasoning improved but still not top-tier (AP Calculus BC: 43~59th percentile)

6.5 Safety and Alignment Improvements

GPT-4 was significantly improved in safety compared to GPT-3.5.

RLHF-based Safety Training:

  • Introduced additional safety reward signals in the training process
  • Used GPT-4 Zero-shot Classifier to judge safety boundaries and response styles
  • Applied safety rewards to both allowed/disallowed categories to prevent over-refusal of valid requests

Quantitative Improvements:

  • 82% reduction in response rate to disallowed content requests compared to GPT-3.5
  • 29% improvement in policy compliance for sensitive requests (medical advice, self-harm, etc.)
  • 40% higher score on internal adversarial factuality evaluation compared to GPT-3.5
  • Improvement from approximately 60% to 80% on TruthfulQA after RLHF

Expert Red-teaming:

  • Over 50 domain experts (AI safety, cybersecurity, biological risks, international security, etc.) participated in adversarial testing
  • Evaluation of high-risk scenarios (autonomous replication, chemical/biological weapons information, etc.)

6.6 GPT-4's Limitations

The limitations explicitly acknowledged in the paper are:

  • Hallucination: Can still "confidently" generate factually incorrect information. Greatly improved by RLHF but not fully resolved.
  • Context Window Limitation: Limited to 8K/32K tokens at training time, limiting very long document processing.
  • Training Data Cutoff: Does not know information after the training data cutoff (trained on data up to September 2021).
  • Incomplete Reasoning: Can make mistakes in complex multi-step reasoning, especially in mathematical proofs and subtle code bugs.
  • Bias and Calibration: Social biases have not been fully removed, and the model's confidence does not necessarily match actual accuracy.

7. In-depth Analysis of Scaling Laws

7.1 Kaplan Scaling Laws (2020)

"Scaling Laws for Neural Language Models" published by Jared Kaplan and others at OpenAI contemporaneously with GPT-3 provided the theoretical foundation for large language model research.

Key Finding -- Power Law Relationships:

The cross-entropy loss LL of a language model has a Power Law relationship with the number of model parameters NN, dataset size DD, and compute CC used for training.

L(N)NαN,αN0.076L(N) \propto N^{-\alpha_N}, \quad \alpha_N \approx 0.076 L(D)DαD,αD0.095L(D) \propto D^{-\alpha_D}, \quad \alpha_D \approx 0.095 L(C)CαC,αC0.050L(C) \propto C^{-\alpha_C}, \quad \alpha_C \approx 0.050

These relationships hold over more than 7 orders of magnitude and show very stable trend lines.

Compute-optimal Allocation (Kaplan Version):

To minimize loss with a fixed compute budget CC, the conclusion was that it is optimal to increase model size while using relatively less data. Specifically, when compute increases 10x, it is most efficient to increase model size by 5.5x and data by only 1.8x.

NoptC0.73,DoptC0.27N_{\text{opt}} \propto C^{0.73}, \quad D_{\text{opt}} \propto C^{0.27}

This result led to the interpretation that "increasing model size is more efficient than increasing data," and served as justification for GPT-3's 175B parameter scale.

7.2 Chinchilla Scaling Laws (2022)

An important correction to Kaplan's Scaling Laws was presented in DeepMind's 2022 "Training Compute-Optimal Large Language Models" (known as the Chinchilla paper).

Key Finding: Existing models are under-trained.

Unlike Kaplan's conclusion, the Chinchilla paper argued that model size and training data should be increased at nearly equal rates. Specifically, approximately 20 training tokens per parameter is compute-optimal.

NoptC0.50,DoptC0.50N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}

By this criterion, GPT-3 (175B parameters, 300B tokens) was data-starved. Compute-optimal training would have required approximately 3.5T (3.5 trillion) tokens.

Chinchilla vs. GPT-3:

ItemGPT-3Chinchilla
Parameters175B70B
Training Tokens300B1.4T
Token/Parameter Ratio1.720
MMLU Performance70.0%73.4%
Compute~3,640 PF-days~5,200 PF-days

Chinchilla is a 2.5x smaller model than GPT-3 but achieved higher performance by training on 4.7x more data. This result fundamentally influenced the direction of subsequent large-scale model training.

7.3 Impact of Scaling Laws on GPT-4

GPT-4's Predictable Scaling is a direct application of this Scaling Laws research. If the loss of small models follows a Power Law, then the trend line can be extrapolated to predict the loss of large models.

What the GPT-4 paper showed is that this prediction is surprisingly accurate. This suggests that Scaling Laws are not merely empirical observations but reflect deep structural properties of the language model training process.

However, there are important limitations to this predictability:

  • Loss is not equal to Capability: Reduction in overall loss may not directly translate to improvement in specific abilities
  • Emergent Abilities: Abilities that suddenly appear at a certain scale are difficult to predict with Power Laws
  • Inverse Scaling: In some tasks, performance decreases as the model grows larger
  • Task-specific Variability: Scaling efficiency varies significantly across tasks

8. Overall Architecture Comparison

8.1 Generation-by-Generation Architecture Comparison Table

ItemGPT-1GPT-2 (XL)GPT-3 (175B)InstructGPTGPT-4
Release Date2018.062019.022020.052022.032023.03
Parameters117M1,542M175,000M1,300M~175,000MUndisclosed
Layers12489696 (175B basis)Undisclosed
Hidden Dim7681,60012,28812,288 (175B basis)Undisclosed
Attention Heads12259696 (175B basis)Undisclosed
Head Dimension6464128128 (175B basis)Undisclosed
Context Window5121,0242,0482,0488,192 / 32,768
Vocabulary Size40,00050,25750,25750,257~100,000 (est.)
Training DataBooksCorpus (5GB)WebText (40GB)Mixed (570GB)GPT-3 + Human FeedbackUndisclosed
Training Tokens~1B (est.)~10B (est.)300B300B + RLHFUndisclosed
TokenizationBPE (40K merges)Byte-level BPEByte-level BPEByte-level BPEUndisclosed
Positional Enc.LearnedLearnedLearnedLearnedUndisclosed
ActivationGELUGELUGELUGELUUndisclosed
LayerNormPost-normPre-normPre-normPre-normUndisclosed
Training MethodLM + Fine-tuningLM onlyLM onlyLM + SFT + RLHFLM + SFT + RLHF
MultimodalNoNoNoNoYes (Image Input)
Sparse AttentionNoNoYes (partial)Yes (partial)Undisclosed

8.2 Evolution of Paradigms

More important than the architecture itself is the evolution of paradigms.

GPT-1: Pre-train -> Fine-tune (fine-tuning required for each task)
         |
GPT-2: Pre-train -> Zero-shot (direct use without fine-tuning)
         |
GPT-3: Pre-train -> In-context Learning (task performance with examples only)
         |
InstructGPT: Pre-train -> SFT -> RLHF (alignment with human feedback)
         |
GPT-4: Pre-train -> SFT -> RLHF + Multimodal (multimodal + enhanced safety)

The consistent direction of this evolution is reducing user intervention. GPT-1 required training data and fine-tuning for each task, but by GPT-4, nearly all tasks can be performed with natural language instructions alone.


9. GPT's Impact: Transformation of the AI Ecosystem

9.1 ChatGPT and AI Democratization

The most direct impact of the GPT series is the democratization of AI through ChatGPT.

ChatGPT Growth Metrics:

  • Released November 30, 2022
  • 1 million users in 5 days
  • 100 million users in 2 months (fastest record ever, surpassing TikTok's 9 months)
  • Over 700 million weekly active users by end of 2024

ChatGPT transformed the concept of "AI" from an exclusive domain of researchers and developers to an everyday tool for the general public. This transformation would have been impossible without InstructGPT's RLHF technology.

9.2 API Economy and AI-native Services

GPT-3's API release (June 2020) marked the beginning of the AI API Economy.

New Business Models:

  • Wrapper Services: Building specialized UX on top of the GPT API (Jasper, Copy.ai, etc.)
  • Vertical AI: AI solutions optimized for specific domains (Harvey for Law, Hippocratic AI for Healthcare)
  • AI-augmented SaaS: Integrating AI features into existing SaaS (Notion AI, GitHub Copilot, etc.)
  • Agent Frameworks: Autonomous agents using GPT as the core reasoning engine (AutoGPT, LangChain, etc.)

9.3 Academic Impact

The GPT series also had a fundamental impact on the direction of academic research.

Birth of New Research Fields:

  • Prompt Engineering: Research on prompt design to maximize the effectiveness of in-context learning
  • Alignment Research: Various alignment techniques beyond RLHF (DPO, ORPO, Constitutional AI, etc.)
  • Mechanistic Interpretability: Research to understand the internal workings of large models
  • Scaling Laws: Quantitative analysis of the relationship between model performance and resources
  • Evaluation: Recognizing the limitations of existing benchmarks and developing new evaluation methodologies

Changes in Research Methodology:

  • Shift in research focus from "model architecture innovation" to "data, training methods, alignment"
  • Growing gap between academic and industrial research due to increasing compute requirements
  • Partial recovery of academic accessibility through open-source models (LLaMA, Mistral, etc.)

9.4 Impact on Industry and Society

  • Education: AI tutors, automated grading, personalized learning content generation
  • Healthcare: Medical document writing assistance, diagnostic support, drug interaction analysis
  • Law: Case search, contract analysis, legal advice drafting
  • Software Development: Code generation, debugging, documentation (GitHub Copilot)
  • Content Creation: Writing assistance, translation, summarization, idea generation

10. Limitations and Criticisms

10.1 Hallucination

The most serious limitation of the GPT series is the Hallucination problem -- confidently generating information that is factually incorrect.

Types of Hallucination:

  • Factual Errors: Non-existent citations, incorrect statistics, fabricated historical facts
  • Logical Leaps: Jumping from premises to conclusions without valid reasoning
  • Self-contradiction: Making contradictory claims within the same conversation

Root Causes:

  • Auto-regressive models simply generate "plausible next tokens" without verifying factual accuracy
  • Training data contains errors, and the model cannot distinguish them
  • RLHF may encourage confident errors by rewarding "speaking confidently"

GPT-4 reduced hallucination by approximately 40% compared to GPT-3.5 through RLHF, but complete resolution remains elusive. This is one of the most active research areas in current LLM research.

10.2 Bias

Large language models reflect and sometimes amplify social biases inherent in their training data.

Types of Bias:

  • Gender Bias: Reflection of stereotypes in occupations, personality traits, etc.
  • Racial/Ethnic Bias: Negative associations with specific races
  • Cultural Bias: English-speaking, particularly US-centric worldview
  • Socioeconomic Bias: Overrepresentation of certain class perspectives

The GPT-3 paper explicitly acknowledged this and included bias analysis related to Gender, Race, and Religion. InstructGPT and GPT-4 attempted to reduce bias through RLHF, but completely eliminating bias inherent in training data remains a fundamentally challenging problem.

10.3 Environmental Cost

The environmental cost of large-scale model training is becoming an increasingly significant concern.

Estimated Training Carbon Emissions:

  • GPT-3: Approximately 552 tons CO2e (equivalent to the annual emissions of about 120 average US cars)
  • GPT-4: Estimated at approximately 15,000 tons CO2e (unofficial estimate, about 27x GPT-3)

Water Consumption:

  • Microsoft reportedly used approximately 700,000 liters of freshwater for data center cooling during GPT-3 training

Criticism and Counterarguments:

  • While the cost of a single training run is large, the trained model is used by hundreds of millions, so the per-person cost is negligible
  • Model efficiency improvements (Distillation, Quantization, Pruning) and hardware advances are reducing costs
  • However, concerns about Jevons Paradox (where efficiency improvements actually increase total consumption) also exist

10.4 Transparency and Reproducibility

One of the most persistent criticisms of the GPT series is lack of transparency.

  • GPT-1: Paper, code, and model released (relatively open)
  • GPT-2: Paper released, model released in stages ("too dangerous" controversy)
  • GPT-3: Paper released, model accessible only via API
  • GPT-4: Architecture, data, training cost, and other key information undisclosed

This trend has deepened the disconnect with the organization's name "Open" AI and seriously undermined academic reproducibility. In response, the importance of open models such as Meta's LLaMA and Mistral AI's Mistral/Mixtral has become more prominent.

10.5 Economic Inequality and Compute Divide

The concentration of resources needed for large-scale model training exacerbates economic inequality in AI research.

  • GPT-3 training cost: Approximately $4.6 million (estimated)
  • GPT-4 training cost: Over approximately $100 million (estimated)
  • Investments of this scale are possible only for a few large corporations, structurally excluding universities and small research labs

11. Summary: The Legacy of GPT

The key insights running through the five papers of the GPT series can be summarized as follows:

1. Scale is (almost) all you need

The scaling from GPT-1 (117M) to GPT-2 (1.5B) to GPT-3 (175B) was not simply "the same thing but bigger" -- it led to qualitatively new emergent abilities. Zero-shot, in-context learning, and complex reasoning were Emergent Abilities that appear only at sufficient scale.

2. Alignment changes everything

InstructGPT showed that training methodology can matter more than model size. The 1.3B InstructGPT beating 175B GPT-3 demonstrated that there is a large gap between raw capability and usefulness, and RLHF can bridge that gap.

3. The bitter lesson revisited

Rich Sutton's "The Bitter Lesson" -- general methods + more compute beat specialized methods -- was repeatedly confirmed in the GPT series. General-purpose Transformer + large-scale pre-training was overwhelmingly more effective than task-specific architectures.

4. Data is the new bottleneck

After Chinchilla's lesson, the quantity and quality of training data emerged as a key bottleneck alongside model size. High-quality text on the internet is finite, and Synthetic Data generation is emerging as a new research direction.

5. Safety is not optional

From GPT-2's "too dangerous to release" controversy to GPT-4's red-teaming, safety has become mandatory, not optional. As AI models become more powerful, the importance of safe and responsible development grows proportionally.

The GPT series is not yet over. What capabilities GPT-5 and beyond will show remains unknown, but one thing is certain: the paradigm of "large-scale pre-training + human feedback alignment" established by the GPT series has become the foundation of modern AI, and understanding it is essential for understanding the future of AI.


12. References

  1. GPT-1: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI Paper

  2. GPT-2: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Paper

  3. GPT-3: Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. arXiv:2005.14165

  4. InstructGPT: Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. arXiv:2203.02155

  5. GPT-4: OpenAI. (2023). "GPT-4 Technical Report." arXiv:2303.08774

  6. Scaling Laws: Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361

  7. Chinchilla: Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556

  8. Transformer: Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762

  9. PPO: Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347

  10. RLHF: Christiano, P. F., Leike, J., Brown, T., et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. arXiv:1706.03741

  11. Sparse Transformer: Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv:1904.10509

  12. BPE: Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. arXiv:1508.07909

  13. Carbon Footprint: Patterson, D., Gonzalez, J., Le, Q., et al. (2021). "Carbon Emissions and Large Neural Network Training." arXiv:2104.10350

GitHub