- Authors
- Name
- 1. Paper Overview
- 2. Background: Why Bidirectionality Was Needed
- 3. BERT Architecture
- 4. Input Representation
- 5. Pre-training Methodology
- 6. Fine-tuning Strategy
- 7. Summary of Key Formulas
- 8. Experimental Results
- 9. Code Examples: HuggingFace Transformers
- 10. BERT's Impact and Subsequent Research
- 11. Limitations and Lessons
- 12. Conclusion
- References
1. Paper Overview
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" was published in October 2018 by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova from Google AI Language. It was subsequently selected as Best Long Paper at NAACL 2019.
BERT stands for Bidirectional Encoder Representations from Transformers, and as the name implies, it is a language representation model that pre-trains the Transformer's Encoder bidirectionally. The core idea is simple: learn bidirectional context from a large-scale unsupervised text corpus, then perform minimal task-specific Fine-tuning.
This simple approach simultaneously achieved State-of-the-Art (SOTA) on 11 NLP benchmarks. It pushed the GLUE score to 80.5% (a 7.7% absolute improvement), recorded SQuAD v1.1 F1 of 93.2 (1.5 point improvement), and SQuAD v2.0 F1 of 83.1 (5.1 point improvement). BERT is the paper that established the Pre-training + Fine-tuning paradigm in NLP and became the starting point for virtually every language model that followed.
2. Background: Why Bidirectionality Was Needed
2.1 Two Branches of Pre-trained Language Representations
Before BERT, approaches leveraging pre-trained language representations were broadly split into two branches.
Feature-based Approach (ELMo)
ELMo (Embeddings from Language Models) proposed by Peters et al. (2018) independently trained a Forward LSTM and a Backward LSTM, then concatenated the hidden states from both directions to generate context-dependent word representations. ELMo representations were used as input features for downstream tasks, requiring a separate architecture design for each task.
Here, and are the hidden states of the Forward and Backward LSTM's -th layer, respectively, and are learnable weights. The key limitation is that the Forward and Backward directions are trained independently. The information from both directions does not interact in deeper layers.
Fine-tuning Approach (OpenAI GPT)
GPT (Generative Pre-Training) by Radford et al. (2018) used a Transformer Decoder for Left-to-Right Language Modeling during pre-training, then fine-tuned the entire model for downstream tasks. While it had the advantage of minimizing task-specific architecture changes, it had the fundamental limitation of encoding context only unidirectionally (Left-to-Right).
2.2 The Limitations of Unidirectionality
To understand the meaning of "bank" in "The bank of the river," one must consider not only "The" to the left of "bank" but also "river" to the right. A Left-to-Right model cannot reference "river" when encoding "bank." This was the fundamental limitation of GPT-1.
ELMo considers both directions, but it independently trains Forward and Backward and then simply concatenates them — a "shallow bidirectionality." True bidirectional representation requires that left and right context simultaneously interact and learn together at every layer.
2.3 Comparison of Three Approaches
| Property | ELMo | GPT-1 | BERT |
|---|---|---|---|
| Architecture | Bi-LSTM | Transformer Decoder | Transformer Encoder |
| Directionality | Shallow bidirectional (independent) | Unidirectional (L-to-R) | Deep bidirectional |
| Pre-training Objective | Forward + Backward LM | Left-to-Right LM | MLM + NSP |
| Downstream Application | Feature-based | Fine-tuning | Fine-tuning |
| Task-specific Architecture | Required for each task | Minimal changes | Minimal changes |
| Parameters | 94M | 117M | 110M (Base) / 340M (Large) |
BERT's core contribution was combining the strengths of both approaches. Like GPT, it is Fine-tuning-based, and like ELMo, it leverages bidirectional context — but instead of independent training, it learns bidirectional context simultaneously across all layers (deeply).
3. BERT Architecture
3.1 Transformer Encoder Stack
BERT uses only the Encoder portion of the Transformer architecture from Vaswani et al. (2017). It does not use the Decoder. Each Encoder layer consists of two Sub-layers.
- Multi-Head Self-Attention: All tokens in the input sequence attend to all other tokens bidirectionally
- Position-wise Feed-Forward Network: Performs nonlinear transformations independently on each token
Residual Connection and Layer Normalization are applied to each Sub-layer.
Because it does not use a Decoder, there is no Auto-regressive Masking, and tokens at all positions can freely attend to tokens at all other positions. This is the structural basis that enables BERT to be a truly bidirectional (deeply bidirectional) model.
3.2 BERT-Base vs BERT-Large
The paper presents two model sizes.
| Setting | BERT-Base | BERT-Large |
|---|---|---|
| Layers () | 12 | 24 |
| Hidden Size () | 768 | 1,024 |
| Attention Heads () | 12 | 16 |
| Feed-Forward Size | 3,072 () | 4,096 () |
| Total Parameters | 110M | 340M |
BERT-Base was intentionally designed to have the same model size (number of layers, Hidden Size, Attention Heads) as GPT-1. This was to ensure a fair comparison showing that the difference in pre-training methodology, rather than architecture size, was the cause of performance improvement.
The dimension of each Attention Head is . For BERT-Base this is , and for BERT-Large it is , so both models have a per-head dimension of 64.
3.3 Activation Function: GELU
BERT uses GELU (Gaussian Error Linear Unit) as its activation function instead of the ReLU from the original Transformer paper.
Unlike ReLU, GELU activates smoothly depending on the input value and maintains a slight gradient even in the negative region. It has since become the standard in most Transformer-based models including GPT-2 and RoBERTa.
4. Input Representation
BERT's input representation is composed of the sum of three Embeddings. This design allows a single model to handle both single-sentence and sentence-pair tasks.
4.1 WordPiece Tokenization
BERT uses a vocabulary of 30,000 WordPiece tokens. WordPiece is a variant of Byte-Pair Encoding (BPE) that merges subwords based on likelihood rather than frequency.
For example, "playing" is split into "play" + "##ing." The "##" prefix indicates that the token is a continuation of the previous token rather than the start of a word. The advantages of this approach are:
- Solving the OOV (Out-of-Vocabulary) problem: Any word can be represented as a combination of subwords
- Preserving morphological information: "play" and "playing" share the common token "play"
- Vocabulary size control: All text can be covered with a limited vocabulary of 30,000
4.2 Special Tokens: [CLS] and [SEP]
BERT uses two special tokens.
[CLS] (Classification Token): Added at the very beginning of every input sequence. The final hidden state of this token is used as the aggregate representation of the entire sequence for classification tasks. Through Self-Attention, information from all tokens in the sequence converges at [CLS].
[SEP] (Separator Token): Used to separate two sentences. For single-sentence tasks, it is placed at the end of the sentence; for sentence-pair tasks, it is placed between the two sentences and at the end of the second sentence.
Input format examples:
Single sentence: [CLS] I love NLP [SEP]
Sentence pair: [CLS] How old are you ? [SEP] I am 25 years old . [SEP]
4.3 Sum of Three Embeddings
The input representation of each token is generated by element-wise summation of the following three Embeddings.
Token Embedding: A learnable Embedding for WordPiece tokens. It has an Embedding matrix of size for a vocabulary size of 30,000.
Segment Embedding: A learnable Embedding that distinguishes whether the input belongs to Sentence A or Sentence B. Tokens belonging to Sentence A are represented by , and those belonging to Sentence B by . For single-sentence tasks, all tokens use .
Position Embedding: A learnable Embedding that encodes position information within the sequence. Unlike the Sinusoidal Positional Encoding of the original Transformer paper, BERT uses learned Position Embeddings. The maximum sequence length is 512.
The sum of these three Embeddings is fed into the first layer of the Transformer Encoder.
5. Pre-training Methodology
BERT's pre-training consists of two unsupervised tasks. The design of these two tasks is the core contribution of the BERT paper.
5.1 Masked Language Model (MLM)
Motivation: The Dilemma of Bidirectional Training
A standard Language Model predicts , the probability of the next token given the previous tokens. This is inherently Left-to-Right. If a Bidirectional Language Model were trained directly, each token could indirectly "see itself." In a multi-layer network, bidirectional context circulates, leaking information about the prediction target word.
BERT resolved this dilemma with the Masked Language Model, inspired by the Cloze Task (fill-in-the-blank).
15% Masking Strategy
In each training sequence, 15% of WordPiece tokens are randomly selected for masking. The training objective is to predict the original words of the selected tokens.
However, simply replacing with the [MASK] token creates a problem. Since the [MASK] token never appears in the input during Fine-tuning, a mismatch occurs between pre-training and Fine-tuning.
80/10/10 Rule
To mitigate this mismatch, the selected 15% of tokens are processed in the following proportions.
| Proportion | Processing Method | Example (when "hairy" is selected from "my dog is hairy") |
|---|---|---|
| 80% | Replace with [MASK] | my dog is [MASK] |
| 10% | Replace with random token | my dog is apple |
| 10% | Keep original token | my dog is hairy |
The specific effects of this strategy are as follows.
- 80% [MASK]: The model learns to reconstruct the original word from context
- 10% Random: Forces the model to be uncertain about whether any input token is the real word, maintaining correct representations at all positions
- 10% Unchanged: Reduces the gap with actual inputs observed during Fine-tuning
The paper's Ablation Study showed that the 80/10/10 ratio was optimal, and using 100% [MASK] significantly degraded performance in the feature-based approach.
MLM Loss Function
The MLM loss function computes Cross-Entropy Loss only at masked positions.
Here, is the set of masked token indices, is the masked input sequence, and is the probability of the model predicting the original token at masked position .
Specifically, the final hidden state at each masked position is projected to the vocabulary size and Softmax is applied.
Here, is the output weight matrix and is the vocabulary size (30,000).
A downside of MLM is that only 15% of tokens are predicted per batch, so convergence is slower than standard Left-to-Right LM. However, the paper empirically demonstrated that the performance gains more than compensated for this cost.
5.2 Next Sentence Prediction (NSP)
Motivation
Tasks like Question Answering (QA) and Natural Language Inference (NLI) require understanding the relationship between two sentences. This relationship is difficult to learn through simple Language Modeling alone.
Training Method
NSP is a binary classification task. Given two sentences A and B, it predicts whether B actually follows A (IsNext) or was randomly selected from the corpus (NotNext).
- 50%: B is the actual next sentence after A (IsNext)
- 50%: B is randomly selected from the corpus (NotNext)
[CLS] The man went to [MASK] store [SEP] He bought a gallon [MASK] milk [SEP]
Label: IsNext
[CLS] The man [MASK] to the store [SEP] Penguin [MASK] are flight ##less birds [SEP]
Label: NotNext
The final hidden state of the [CLS] token is fed into a binary classifier for prediction.
Effect and Controversy of NSP
In the paper's Ablation Study, removing NSP resulted in significant performance drops: QNLI (-3.5%), MNLI (-0.5%), SQuAD (-0.6%). The decline was particularly notable in tasks where sentence-pair relationship reasoning is important.
However, subsequent research by RoBERTa (Liu et al., 2019) questioned the effectiveness of NSP. RoBERTa achieved higher performance than BERT while removing NSP, arguing that NSP's effect was more heavily influenced by how the training data was constructed.
5.3 Total Pre-training Loss
The final pre-training loss is the sum of the MLM and NSP losses.
5.4 Pre-training Data and Settings
| Setting | Value |
|---|---|
| Training Data | BooksCorpus (800M words) + English Wikipedia (2,500M words) |
| Total Data Size | Approximately 16GB of text |
| Vocabulary Size | 30,000 WordPiece tokens |
| Max Sequence Length | 512 tokens |
| Batch Size | 256 sequences (128,000 tokens/batch) |
| Training Steps | 1,000,000 steps (approximately 40 epochs) |
| Optimizer | Adam (, , ) |
| Learning Rate | 1e-4 (linear decay after 10,000 steps warmup) |
| Dropout | 0.1 (all layers) |
| Activation | GELU |
| Hardware | 4 Cloud TPU Pods (16 TPU chips, BERT-Base) / 16 Cloud TPU Pods (64 TPU chips, BERT-Large) |
| Training Time | BERT-Base: 4 days, BERT-Large: 4 days |
6. Fine-tuning Strategy
BERT's Fine-tuning is remarkably simple. A single task-specific output layer is added on top of the pre-trained BERT, and the entire model is Fine-tuned End-to-End. Compared to pre-training, Fine-tuning is very fast. For most tasks, it completes within 1 hour on a single Cloud TPU or within a few hours on GPUs.
6.1 Fine-tuning Hyperparameters
The following hyperparameter ranges work well for most tasks.
| Hyperparameter | Recommended Range |
|---|---|
| Batch Size | 16, 32 |
| Learning Rate (Adam) | 5e-5, 4e-5, 3e-5, 2e-5 |
| Epochs | 2, 3, 4 |
| Dropout | 0.1 (same as pre-training) |
6.2 Sentence/Sentence Pair Classification
For tasks such as sentiment analysis (SST-2), natural language inference (MNLI, RTE), and sentence similarity (STS-B, MRPC, QQP), the final hidden state of the [CLS] token is fed into a classifier.
Here, and is the number of labels.
For sentence pair tasks, the input format is:
[CLS] Sentence A [SEP] Sentence B [SEP]
Segment Embedding distinguishes Sentence A from B, and the [CLS] representation encodes the relationship between the two sentences.
6.3 Question Answering (QA)
In SQuAD (Stanford Question Answering Dataset), given a question and passage, the model predicts the start and end positions of the answer span in the passage.
[CLS] Question [SEP] Passage [SEP]
Start position vector and end position vector are learned, and the start/end probabilities are computed for each token in the passage.
Here, is the final hidden state of token . The score of candidate answer span is computed as ().
For SQuAD v2.0, which also requires handling cases with no answer, both start and end positions are set to [CLS] when there is no answer, and a "no answer" probability is computed.
6.4 Named Entity Recognition (NER)
In the CoNLL-2003 NER task, each token is classified as Person, Organization, Location, Miscellaneous, or Other. Here, the final hidden state of each token is fed into the classifier, not the [CLS] representation.
For subword tokens split by WordPiece, the prediction of the first subword is typically used as the label for the corresponding word.
6.5 Sequence Labeling Generalization
The same approach can be applied to all tasks that assign a label to each token in a sequence, including NER, POS Tagging, and Chunking. Because each of BERT's token representations sufficiently encodes bidirectional context, high performance is achieved without additional sequence modeling layers like CRF (Conditional Random Field).
6.6 Feature-based Approach
BERT can also be used as a Feature-based approach beyond Fine-tuning. The Transformer Encoder is frozen, and hidden states from specific layers are extracted as features to be fed into a separate model.
In the paper's experiment applying the Feature-based approach to CoNLL-2003 NER, concatenating the hidden states of the last 4 layers achieved 96.1% F1 (dev), close to the Fine-tuning approach's 96.4% F1 (dev). This demonstrates that BERT's pre-trained representations themselves contain very rich linguistic information.
7. Summary of Key Formulas
7.1 Self-Attention (Scaled Dot-Product Attention)
The Self-Attention formula performed in each Encoder layer of BERT.
Here, , , , and is the dimension of each Attention Head.
7.2 Multi-Head Attention
BERT-Base uses and BERT-Large uses heads.
7.3 Feed-Forward Network
The Position-wise FFN after each Attention Sub-layer.
Here, and . It is a Bottleneck structure where the inner dimension is 4 times the Hidden Size.
7.4 MLM Loss (Summary)
Cross-Entropy Loss over the set of masked tokens .
Here, is the final hidden state at masked position , is the -th row of the output Embedding matrix, and .
7.5 NSP Loss
Here, is the actual label.
7.6 Total Pre-training Objective
8. Experimental Results
8.1 GLUE Benchmark
GLUE (General Language Understanding Evaluation) is a benchmark consisting of 9 NLU tasks. BERT-Large achieved an overall GLUE score of 80.5%, recording a 7.7% absolute improvement over the previous SOTA.
| Task | Metric | BERT-Base | BERT-Large | Previous SOTA |
|---|---|---|---|---|
| MNLI-m / MNLI-mm | Accuracy | 84.6 / 83.4 | 86.7 / 85.9 | 80.6 / 80.1 |
| QQP | F1 | 71.2 | 72.1 | 66.1 |
| QNLI | Accuracy | 90.5 | 92.7 | 87.4 |
| SST-2 | Accuracy | 93.5 | 94.9 | 93.2 |
| CoLA | Matthews Corr | 52.1 | 60.5 | 35.0 |
| STS-B | Spearman Corr | 85.8 | 86.5 | 81.0 |
| MRPC | F1 | 88.9 | 89.3 | 86.0 |
| RTE | Accuracy | 66.4 | 70.1 | 61.7 |
| WNLI | Accuracy | - | 65.1 | 65.1 |
The 25.5% absolute improvement over the previous SOTA on CoLA is particularly impressive. CoLA (Corpus of Linguistic Acceptability) is a task that judges grammatical acceptability of sentences, requiring deep language understanding.
8.2 SQuAD v1.1
SQuAD v1.1 (Stanford Question Answering Dataset) is an Extractive QA task that extracts answer spans from passages.
| Model | EM (Exact Match) | F1 |
|---|---|---|
| Previous SOTA (single model) | 84.4 | 91.0 |
| BERT-Large (single model) | 84.1 | 90.9 |
| Previous SOTA (ensemble) | 86.7 | 91.7 |
| BERT-Large + TriviaQA (ensemble) | 87.4 | 93.2 |
The ensemble model leveraging TriviaQA data achieved F1 of 93.2, surpassing human performance (91.2 F1) for the first time.
8.3 SQuAD v2.0
SQuAD v2.0 adds unanswerable questions to v1.1.
| Model | EM | F1 |
|---|---|---|
| Previous SOTA | 73.7 | 77.0 |
| BERT-Large (single model) | 80.0 | 83.1 |
It achieved a 6.1 point improvement in F1 over the previous SOTA. BERT's bidirectional context understanding showed particular strength in the ability to determine whether a question is answerable.
8.4 SWAG
SWAG (Situations With Adversarial Generations) is a commonsense reasoning task that selects the most appropriate continuation sentence from 4 candidates given a context sentence.
| Model | Dev Accuracy | Test Accuracy |
|---|---|---|
| Human | - | 88.0 |
| ESIM + ELMo | 51.9 | 52.7 |
| OpenAI GPT | - | 78.0 |
| BERT-Base | 81.6 | - |
| BERT-Large | 86.6 | 86.3 |
BERT-Large surpassed GPT by 8.3% and approached human performance (88.0%).
8.5 Ablation Study: Impact of Pre-training Tasks
The paper analyzed the importance of pre-training tasks using BERT-Base as the baseline.
| Setting | MNLI-m | QNLI | MRPC | SST-2 | SQuAD F1 |
|---|---|---|---|---|---|
| BERT-Base (MLM + NSP) | 84.4 | 88.4 | 86.7 | 92.7 | 88.5 |
| No NSP | 83.9 | 84.9 | 86.5 | 92.6 | 87.9 |
| LTR (Left-to-Right) & No NSP | 82.1 | 84.3 | 77.5 | 92.1 | 77.8 |
| LTR + BiLSTM & No NSP | 82.1 | 84.1 | 75.7 | 91.6 | 84.9 |
Key findings:
- NSP Removal: 3.5% drop on QNLI; significant impact on tasks where sentence-pair relationships matter
- Left-to-Right: 10.7 F1 drop on SQuAD compared to bidirectional; bidirectionality is critical for token-level tasks
- LTR + BiLSTM: Even adding a BiLSTM cannot substitute for bidirectional pre-training
8.6 Ablation Study: Impact of Model Size
| Setting | L | H | A | Params | MNLI-m | MRPC | SST-2 | SQuAD F1 |
|---|---|---|---|---|---|---|---|---|
| 3-layer | 3 | 768 | 12 | 45M | 77.9 | 79.8 | 88.4 | 75.6 |
| 6-layer | 6 | 768 | 12 | 67M | 80.6 | 84.3 | 91.1 | 83.7 |
| BERT-Base | 12 | 768 | 12 | 110M | 84.4 | 86.7 | 92.7 | 88.5 |
| BERT-Large | 24 | 1024 | 16 | 340M | 86.6 | 87.8 | 93.7 | 91.3 |
Increasing model size resulted in consistent performance improvements across all tasks. It is particularly noteworthy that the larger model performed better even on small datasets (MRPC: 3,600 training examples). This suggests that pre-training provides sufficient knowledge to effectively train large models even with limited Fine-tuning data.
9. Code Examples: HuggingFace Transformers
9.1 Masked Language Model Prediction with BERT
from transformers import BertTokenizer, BertForMaskedLM
import torch
# Load model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
model.eval()
# Prepare masked sentence
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
# Find [MASK] position
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
# Predict
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# Extract Top-5 predicted tokens
mask_logits = logits[0, mask_token_index, :]
top5_tokens = torch.topk(mask_logits, 5, dim=1)
print(f"Input: {text}")
print("Top-5 Predictions:")
for i, (token_id, score) in enumerate(
zip(top5_tokens.indices[0], top5_tokens.values[0])
):
token = tokenizer.decode([token_id])
print(f" {i+1}. {token} (score: {score:.4f})")
Example output:
Input: The capital of France is [MASK].
Top-5 Predictions:
1. paris (score: 18.2341)
2. lyon (score: 12.1456)
3. lille (score: 10.8923)
4. toulouse (score: 10.5678)
5. marseille (score: 10.3210)
9.2 Sentence Classification (Sentiment Analysis) Fine-tuning
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
# Load dataset (SST-2)
dataset = load_dataset("glue", "sst2")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Tokenize
def tokenize_function(examples):
return tokenizer(
examples["sentence"],
padding="max_length",
truncation=True,
max_length=128,
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Initialize model (2-class classification)
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
)
# Training settings (paper's recommended hyperparameters)
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3, # Paper recommendation: 2-4
per_device_train_batch_size=32, # Paper recommendation: 16 or 32
learning_rate=2e-5, # Paper recommendation: 2e-5 ~ 5e-5
weight_decay=0.01,
warmup_ratio=0.1,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
# Train with Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
)
trainer.train()
9.3 Question Answering
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
tokenizer = BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = BertForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model.eval()
# Question and passage
question = "What is BERT?"
context = """BERT is a language representation model developed by Google.
It stands for Bidirectional Encoder Representations from Transformers.
BERT is designed to pre-train deep bidirectional representations from
unlabeled text by jointly conditioning on both left and right context
in all layers."""
# Tokenize
inputs = tokenizer(question, context, return_tensors="pt")
# Predict
with torch.no_grad():
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
# Extract answer
start_idx = torch.argmax(start_logits)
end_idx = torch.argmax(end_logits)
answer_tokens = inputs["input_ids"][0][start_idx : end_idx + 1]
answer = tokenizer.decode(answer_tokens)
print(f"Question: {question}")
print(f"Answer: {answer}")
# Output: Answer: a language representation model developed by google
9.4 Named Entity Recognition (NER)
from transformers import pipeline
# Create NER pipeline
ner_pipeline = pipeline(
"ner",
model="bert-base-cased", # Case sensitivity is important for NER
aggregation_strategy="simple",
)
text = "Google released BERT in 2018 at their Mountain View headquarters."
results = ner_pipeline(text)
for entity in results:
print(f"Entity: {entity['word']:20s} | "
f"Label: {entity['entity_group']:5s} | "
f"Score: {entity['score']:.4f}")
9.5 BERT Embedding Extraction (Feature-based Approach)
from transformers import BertModel, BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
model.eval()
text = "BERT produces contextualized word embeddings."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# All layers' hidden states (13: embedding + 12 layers)
all_hidden_states = outputs.hidden_states # tuple of (batch, seq_len, 768)
# Sentence representation using [CLS] token (last layer)
sentence_embedding = outputs.last_hidden_state[:, 0, :]
print(f"Sentence embedding dimension: {sentence_embedding.shape}") # (1, 768)
# Paper's approach: concatenate last 4 layers (Feature-based)
last_4_layers = torch.cat(
[all_hidden_states[i] for i in [-4, -3, -2, -1]], dim=-1
)
print(f"Last 4 layers concatenation dimension: {last_4_layers.shape}") # (1, seq_len, 3072)
10. BERT's Impact and Subsequent Research
BERT brought a paradigm shift to the field of NLP. The "pre-training + Fine-tuning" methodology has become the standard in NLP, and numerous follow-up studies aimed at overcoming BERT's limitations have emerged.
10.1 RoBERTa (Liu et al., 2019, Meta AI)
Robustly Optimized BERT Pretraining Approach. Without changing BERT's architecture, it significantly improved performance by optimizing only the training methodology.
Key changes:
- NSP Removal: Experimentally demonstrated that NSP actually hurts performance
- Dynamic Masking: While BERT fixes masking during data preprocessing (static), RoBERTa changes the masking pattern every epoch
- More Data: Increased 10x from 16GB to 160GB (added CC-News, OpenWebText, Stories)
- Larger Batches: Increased from 256 to 8,000
- Longer Training: 50,000 BPE vocabulary, more training steps
Result: Surpassed BERT-Large on all tasks across GLUE, SQuAD, and RACE. This suggested that BERT's pre-training had been severely undertrained.
10.2 ALBERT (Lan et al., 2019, Google)
A Lite BERT. A model focused on parameter efficiency.
Key techniques:
- Factorized Embedding Parameterization: Separates the vocabulary Embedding dimension () from the Hidden dimension (). Uses instead of for parameter reduction. Example: with , parameters are reduced to
- Cross-layer Parameter Sharing: All Transformer layers share the same parameters. Greatly reduces parameter count with minimal performance degradation
- Sentence Order Prediction (SOP): Replaces NSP with a task that determines whether two consecutive sentences are in the correct order. A harder task than NSP, more effective for learning inter-sentence relationships
Result: Achieved comparable or better performance with 18x fewer parameters than BERT-Large.
10.3 DistilBERT (Sanh et al., 2019, Hugging Face)
A model that compressed BERT using Knowledge Distillation.
- 40% the size of BERT-Base (66M parameters, 6 layers)
- 60% faster than BERT-Base
- Retains 97% of BERT-Base performance
- Uses Distillation Loss that mimics the Teacher's (BERT-Base) Soft Labels during training
- 86.9 F1 on SQuAD v1.1 (BERT-Base: 88.5 F1)
It became the starting point for lightweight models suitable for mobile and edge device deployment.
10.4 ELECTRA (Clark et al., 2020, Google/Stanford)
Efficiently Learning an Encoder that Classifies Token Replacements Accurately. A model that fundamentally improved the training inefficiency of BERT's MLM.
Core idea:
- Generator-Discriminator Architecture: A small Generator (MLM) fills in masked tokens, and the Discriminator determines whether each token is original or generated by the Generator
- Learning from All Tokens: While BERT learns only from the 15% of masked tokens, ELECTRA learns "original vs. replaced" discrimination from all tokens, greatly improving training efficiency
- Replaced Token Detection (RTD): A more efficient pre-training objective than MLM
Result: Surpassed BERT, RoBERTa, and ALBERT under the same compute budget. The efficiency difference was especially pronounced for small models (ELECTRA-Small).
10.5 DeBERTa (He et al., 2020, Microsoft)
Decoding-enhanced BERT with Disentangled Attention. A model that improved the Attention mechanism itself.
Key innovations:
- Disentangled Attention: Existing BERT sums Token Embedding and Position Embedding before performing Attention, but DeBERTa performs separate Attention for Content and Position. It combines three types of Attention: Content-to-Content, Content-to-Position, and Position-to-Content
- Enhanced Mask Decoder: Injects absolute position information at the decoding layer, utilizing both relative and absolute position information
Result: Higher performance at the same model size compared to BERT and RoBERTa. It surpassed existing models with only half the training data and exceeded human performance on SuperGLUE.
10.6 Follow-up Research Lineage
BERT (2018)
├── RoBERTa (2019) ─── Training methodology optimization
├── ALBERT (2019) ─── Parameter efficiency
├── DistilBERT (2019) ─── Knowledge Distillation
├── ELECTRA (2020) ─── Training efficiency (RTD)
├── DeBERTa (2020) ─── Attention mechanism improvement
├── SpanBERT (2020) ─── Span-level masking
├── ERNIE (2019) ─── Knowledge-enhanced pre-training
└── ModernBERT (2024) ─── Modern architecture application
11. Limitations and Lessons
11.1 Limitations of BERT
1. Pre-train/Fine-tune Discrepancy
The [MASK] token appears only during pre-training and never during Fine-tuning. The 80/10/10 strategy mitigates but does not fundamentally resolve this. ELECTRA's RTD addressed this problem more elegantly.
2. Independence Assumption Between Masked Tokens
MLM predicts masked tokens independently. For example, if "New York" is fully masked, "New" and "York" are predicted independently of each other. In reality, there is a strong dependency between these two tokens. XLNet (Yang et al., 2019) attempted to solve this with Permutation Language Modeling.
3. Training Inefficiency
Since only 15% of tokens are predicted per batch, more training steps are needed for convergence. ELECTRA significantly improved this inefficiency by learning from all tokens.
4. Maximum Sequence Length Limitation
The 512-token maximum length limitation is unsuitable for processing long documents. Due to Self-Attention's complexity, it is difficult to significantly increase the sequence length. Longformer and BigBird mitigated this problem with Sparse Attention.
5. Unsuitable for Generation Tasks
As an Encoder-only model, BERT cannot be directly used for text generation tasks (summarization, translation, dialogue, etc.). Generation tasks are better suited for GPT-family (Decoder-only) or T5/BART-family (Encoder-Decoder) models.
6. Questionable NSP Effectiveness
RoBERTa's experiments revealed that NSP can actually hurt performance. The problem is that NSP is too easy a task — randomly selected sentences can be easily distinguished by topic differences alone.
11.2 Lessons from the Paper
1. The Power of Simple Ideas
BERT's core idea (learning bidirectional context) is remarkably simple. It achieved dramatic performance improvements not through complex architectural innovation but through a change in the pre-training objective (MLM).
2. The Effect of Scaling
The Ablation Study demonstrated that consistently improving performance by increasing model and data size. This became the cornerstone of subsequent Scaling Law research and the development of large-scale models like GPT-3.
3. Establishing the Transfer Learning Paradigm
Just as ImageNet pre-training transformed Computer Vision, BERT proved that leveraging pre-trained language models should be the starting point for every task in NLP. Designing task-specific architectures from scratch was no longer necessary.
4. The Importance of Fair Comparison
By designing BERT-Base to match GPT-1 in size, the paper clearly showed that the difference in methodology, not architecture size, was the cause of performance improvement. This is an important lesson in research paper experimental design.
12. Conclusion
BERT completely changed the landscape of NLP since its publication in 2018. It implemented the simple yet powerful idea of bidirectional context learning through the elegant method of Masked Language Model, and proved its effectiveness by sweeping 11 benchmarks simultaneously.
BERT's greatest legacy is not the numbers on specific benchmarks. It is the establishment of the paradigm: "Pre-train on a large-scale unsupervised corpus, then Fine-tune with a small amount of labeled data." This paradigm has become the foundation of modern language models from GPT-3, T5, and PaLM to LLaMA.
Of course, BERT had its limitations. Issues such as masking mismatch, training inefficiency, and sequence length limitations have been addressed one by one by subsequent research including RoBERTa, ELECTRA, and Longformer. However, the fact that all these follow-up studies were built upon the foundation of BERT speaks to the historical significance of the BERT paper.
References
- Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. https://arxiv.org/abs/1810.04805
- Full paper (HTML version): https://ar5iv.labs.arxiv.org/html/1810.04805
- ACL Anthology: https://aclanthology.org/N19-1423/
- Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762
- Peters, M. et al. (2018). Deep contextualized word representations (ELMo). NAACL 2018. https://arxiv.org/abs/1802.05365
- Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-Training (GPT). https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
- Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://arxiv.org/abs/1907.11692
- Lan, Z. et al. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. https://arxiv.org/abs/1909.11942
- Sanh, V. et al. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/abs/1910.01108
- Clark, K. et al. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. https://arxiv.org/abs/2003.10555
- He, P. et al. (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. https://arxiv.org/abs/2006.03654
- Jay Alammar, The Illustrated BERT, ELMo, and co.: https://jalammar.github.io/illustrated-bert/
- HuggingFace Transformers Documentation: https://huggingface.co/docs/transformers