Skip to content
Published on

Complete Analysis of the BERT Paper: How Bidirectional Transformers Changed the Landscape of NLP

Authors
  • Name
    Twitter

1. Paper Overview

"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" was published in October 2018 by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova from Google AI Language. It was subsequently selected as Best Long Paper at NAACL 2019.

BERT stands for Bidirectional Encoder Representations from Transformers, and as the name implies, it is a language representation model that pre-trains the Transformer's Encoder bidirectionally. The core idea is simple: learn bidirectional context from a large-scale unsupervised text corpus, then perform minimal task-specific Fine-tuning.

This simple approach simultaneously achieved State-of-the-Art (SOTA) on 11 NLP benchmarks. It pushed the GLUE score to 80.5% (a 7.7% absolute improvement), recorded SQuAD v1.1 F1 of 93.2 (1.5 point improvement), and SQuAD v2.0 F1 of 83.1 (5.1 point improvement). BERT is the paper that established the Pre-training + Fine-tuning paradigm in NLP and became the starting point for virtually every language model that followed.


2. Background: Why Bidirectionality Was Needed

2.1 Two Branches of Pre-trained Language Representations

Before BERT, approaches leveraging pre-trained language representations were broadly split into two branches.

Feature-based Approach (ELMo)

ELMo (Embeddings from Language Models) proposed by Peters et al. (2018) independently trained a Forward LSTM and a Backward LSTM, then concatenated the hidden states from both directions to generate context-dependent word representations. ELMo representations were used as input features for downstream tasks, requiring a separate architecture design for each task.

ELMok=γj=0Lsj[hk,j;hk,j]\text{ELMo}_k = \gamma \sum_{j=0}^{L} s_j \cdot [\overrightarrow{h}_{k,j}; \overleftarrow{h}_{k,j}]

Here, hk,j\overrightarrow{h}_{k,j} and hk,j\overleftarrow{h}_{k,j} are the hidden states of the Forward and Backward LSTM's jj-th layer, respectively, and sjs_j are learnable weights. The key limitation is that the Forward and Backward directions are trained independently. The information from both directions does not interact in deeper layers.

Fine-tuning Approach (OpenAI GPT)

GPT (Generative Pre-Training) by Radford et al. (2018) used a Transformer Decoder for Left-to-Right Language Modeling during pre-training, then fine-tuned the entire model for downstream tasks. While it had the advantage of minimizing task-specific architecture changes, it had the fundamental limitation of encoding context only unidirectionally (Left-to-Right).

2.2 The Limitations of Unidirectionality

To understand the meaning of "bank" in "The bank of the river," one must consider not only "The" to the left of "bank" but also "river" to the right. A Left-to-Right model cannot reference "river" when encoding "bank." This was the fundamental limitation of GPT-1.

ELMo considers both directions, but it independently trains Forward and Backward and then simply concatenates them — a "shallow bidirectionality." True bidirectional representation requires that left and right context simultaneously interact and learn together at every layer.

2.3 Comparison of Three Approaches

PropertyELMoGPT-1BERT
ArchitectureBi-LSTMTransformer DecoderTransformer Encoder
DirectionalityShallow bidirectional (independent)Unidirectional (L-to-R)Deep bidirectional
Pre-training ObjectiveForward + Backward LMLeft-to-Right LMMLM + NSP
Downstream ApplicationFeature-basedFine-tuningFine-tuning
Task-specific ArchitectureRequired for each taskMinimal changesMinimal changes
Parameters94M117M110M (Base) / 340M (Large)

BERT's core contribution was combining the strengths of both approaches. Like GPT, it is Fine-tuning-based, and like ELMo, it leverages bidirectional context — but instead of independent training, it learns bidirectional context simultaneously across all layers (deeply).


3. BERT Architecture

3.1 Transformer Encoder Stack

BERT uses only the Encoder portion of the Transformer architecture from Vaswani et al. (2017). It does not use the Decoder. Each Encoder layer consists of two Sub-layers.

  1. Multi-Head Self-Attention: All tokens in the input sequence attend to all other tokens bidirectionally
  2. Position-wise Feed-Forward Network: Performs nonlinear transformations independently on each token

Residual Connection and Layer Normalization are applied to each Sub-layer.

output=LayerNorm(x+Sublayer(x))\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))

Because it does not use a Decoder, there is no Auto-regressive Masking, and tokens at all positions can freely attend to tokens at all other positions. This is the structural basis that enables BERT to be a truly bidirectional (deeply bidirectional) model.

3.2 BERT-Base vs BERT-Large

The paper presents two model sizes.

SettingBERT-BaseBERT-Large
Layers (LL)1224
Hidden Size (HH)7681,024
Attention Heads (AA)1216
Feed-Forward Size3,072 (4H4H)4,096 (4H4H)
Total Parameters110M340M

BERT-Base was intentionally designed to have the same model size (number of layers, Hidden Size, Attention Heads) as GPT-1. This was to ensure a fair comparison showing that the difference in pre-training methodology, rather than architecture size, was the cause of performance improvement.

The dimension of each Attention Head is dk=H/Ad_k = H / A. For BERT-Base this is 768/12=64768 / 12 = 64, and for BERT-Large it is 1024/16=641024 / 16 = 64, so both models have a per-head dimension of 64.

3.3 Activation Function: GELU

BERT uses GELU (Gaussian Error Linear Unit) as its activation function instead of the ReLU from the original Transformer paper.

GELU(x)=xΦ(x)=x12[1+erf(x2)]\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

Unlike ReLU, GELU activates smoothly depending on the input value and maintains a slight gradient even in the negative region. It has since become the standard in most Transformer-based models including GPT-2 and RoBERTa.


4. Input Representation

BERT's input representation is composed of the sum of three Embeddings. This design allows a single model to handle both single-sentence and sentence-pair tasks.

4.1 WordPiece Tokenization

BERT uses a vocabulary of 30,000 WordPiece tokens. WordPiece is a variant of Byte-Pair Encoding (BPE) that merges subwords based on likelihood rather than frequency.

For example, "playing" is split into "play" + "##ing." The "##" prefix indicates that the token is a continuation of the previous token rather than the start of a word. The advantages of this approach are:

  • Solving the OOV (Out-of-Vocabulary) problem: Any word can be represented as a combination of subwords
  • Preserving morphological information: "play" and "playing" share the common token "play"
  • Vocabulary size control: All text can be covered with a limited vocabulary of 30,000

4.2 Special Tokens: [CLS] and [SEP]

BERT uses two special tokens.

[CLS] (Classification Token): Added at the very beginning of every input sequence. The final hidden state of this token is used as the aggregate representation of the entire sequence for classification tasks. Through Self-Attention, information from all tokens in the sequence converges at [CLS].

[SEP] (Separator Token): Used to separate two sentences. For single-sentence tasks, it is placed at the end of the sentence; for sentence-pair tasks, it is placed between the two sentences and at the end of the second sentence.

Input format examples:

Single sentence: [CLS] I love NLP [SEP]
Sentence pair:   [CLS] How old are you ? [SEP] I am 25 years old . [SEP]

4.3 Sum of Three Embeddings

The input representation of each token is generated by element-wise summation of the following three Embeddings.

Token Embedding: A learnable Embedding for WordPiece tokens. It has an Embedding matrix of size R30000×H\mathbb{R}^{30000 \times H} for a vocabulary size of 30,000.

Segment Embedding: A learnable Embedding that distinguishes whether the input belongs to Sentence A or Sentence B. Tokens belonging to Sentence A are represented by EAE_A, and those belonging to Sentence B by EBE_B. For single-sentence tasks, all tokens use EAE_A.

Position Embedding: A learnable Embedding that encodes position information within the sequence. Unlike the Sinusoidal Positional Encoding of the original Transformer paper, BERT uses learned Position Embeddings. The maximum sequence length is 512.

Input(xi)=Etoken(xi)+Esegment(xi)+Eposition(i)\text{Input}(x_i) = E_{\text{token}}(x_i) + E_{\text{segment}}(x_i) + E_{\text{position}}(i)

The sum of these three Embeddings is fed into the first layer of the Transformer Encoder.


5. Pre-training Methodology

BERT's pre-training consists of two unsupervised tasks. The design of these two tasks is the core contribution of the BERT paper.

5.1 Masked Language Model (MLM)

Motivation: The Dilemma of Bidirectional Training

A standard Language Model predicts P(wtw1,...,wt1)P(w_t | w_1, ..., w_{t-1}), the probability of the next token given the previous tokens. This is inherently Left-to-Right. If a Bidirectional Language Model were trained directly, each token could indirectly "see itself." In a multi-layer network, bidirectional context circulates, leaking information about the prediction target word.

BERT resolved this dilemma with the Masked Language Model, inspired by the Cloze Task (fill-in-the-blank).

15% Masking Strategy

In each training sequence, 15% of WordPiece tokens are randomly selected for masking. The training objective is to predict the original words of the selected tokens.

However, simply replacing with the [MASK] token creates a problem. Since the [MASK] token never appears in the input during Fine-tuning, a mismatch occurs between pre-training and Fine-tuning.

80/10/10 Rule

To mitigate this mismatch, the selected 15% of tokens are processed in the following proportions.

ProportionProcessing MethodExample (when "hairy" is selected from "my dog is hairy")
80%Replace with [MASK]my dog is [MASK]
10%Replace with random tokenmy dog is apple
10%Keep original tokenmy dog is hairy

The specific effects of this strategy are as follows.

  • 80% [MASK]: The model learns to reconstruct the original word from context
  • 10% Random: Forces the model to be uncertain about whether any input token is the real word, maintaining correct representations at all positions
  • 10% Unchanged: Reduces the gap with actual inputs observed during Fine-tuning

The paper's Ablation Study showed that the 80/10/10 ratio was optimal, and using 100% [MASK] significantly degraded performance in the feature-based approach.

MLM Loss Function

The MLM loss function computes Cross-Entropy Loss only at masked positions.

LMLM=iMlogP(xix~)\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i | \tilde{x})

Here, M\mathcal{M} is the set of masked token indices, x~\tilde{x} is the masked input sequence, and P(xix~)P(x_i | \tilde{x}) is the probability of the model predicting the original token xix_i at masked position ii.

Specifically, the final hidden state hih_i at each masked position is projected to the vocabulary size and Softmax is applied.

P(xix~)=softmax(Whi+b)xiP(x_i | \tilde{x}) = \text{softmax}(W \cdot h_i + b)_{x_i}

Here, WRV×HW \in \mathbb{R}^{|V| \times H} is the output weight matrix and V|V| is the vocabulary size (30,000).

A downside of MLM is that only 15% of tokens are predicted per batch, so convergence is slower than standard Left-to-Right LM. However, the paper empirically demonstrated that the performance gains more than compensated for this cost.

5.2 Next Sentence Prediction (NSP)

Motivation

Tasks like Question Answering (QA) and Natural Language Inference (NLI) require understanding the relationship between two sentences. This relationship is difficult to learn through simple Language Modeling alone.

Training Method

NSP is a binary classification task. Given two sentences A and B, it predicts whether B actually follows A (IsNext) or was randomly selected from the corpus (NotNext).

  • 50%: B is the actual next sentence after A (IsNext)
  • 50%: B is randomly selected from the corpus (NotNext)
[CLS] The man went to [MASK] store [SEP] He bought a gallon [MASK] milk [SEP]
Label: IsNext

[CLS] The man [MASK] to the store [SEP] Penguin [MASK] are flight ##less birds [SEP]
Label: NotNext

The final hidden state CC of the [CLS] token is fed into a binary classifier for prediction.

P(IsNextC)=softmax(WNSPC)P(\text{IsNext} | C) = \text{softmax}(W_{\text{NSP}} \cdot C)

Effect and Controversy of NSP

In the paper's Ablation Study, removing NSP resulted in significant performance drops: QNLI (-3.5%), MNLI (-0.5%), SQuAD (-0.6%). The decline was particularly notable in tasks where sentence-pair relationship reasoning is important.

However, subsequent research by RoBERTa (Liu et al., 2019) questioned the effectiveness of NSP. RoBERTa achieved higher performance than BERT while removing NSP, arguing that NSP's effect was more heavily influenced by how the training data was constructed.

5.3 Total Pre-training Loss

The final pre-training loss is the sum of the MLM and NSP losses.

L=LMLM+LNSP\mathcal{L} = \mathcal{L}_{\text{MLM}} + \mathcal{L}_{\text{NSP}}

5.4 Pre-training Data and Settings

SettingValue
Training DataBooksCorpus (800M words) + English Wikipedia (2,500M words)
Total Data SizeApproximately 16GB of text
Vocabulary Size30,000 WordPiece tokens
Max Sequence Length512 tokens
Batch Size256 sequences (128,000 tokens/batch)
Training Steps1,000,000 steps (approximately 40 epochs)
OptimizerAdam (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, ϵ=106\epsilon=10^{-6})
Learning Rate1e-4 (linear decay after 10,000 steps warmup)
Dropout0.1 (all layers)
ActivationGELU
Hardware4 Cloud TPU Pods (16 TPU chips, BERT-Base) / 16 Cloud TPU Pods (64 TPU chips, BERT-Large)
Training TimeBERT-Base: 4 days, BERT-Large: 4 days

6. Fine-tuning Strategy

BERT's Fine-tuning is remarkably simple. A single task-specific output layer is added on top of the pre-trained BERT, and the entire model is Fine-tuned End-to-End. Compared to pre-training, Fine-tuning is very fast. For most tasks, it completes within 1 hour on a single Cloud TPU or within a few hours on GPUs.

6.1 Fine-tuning Hyperparameters

The following hyperparameter ranges work well for most tasks.

HyperparameterRecommended Range
Batch Size16, 32
Learning Rate (Adam)5e-5, 4e-5, 3e-5, 2e-5
Epochs2, 3, 4
Dropout0.1 (same as pre-training)

6.2 Sentence/Sentence Pair Classification

For tasks such as sentiment analysis (SST-2), natural language inference (MNLI, RTE), and sentence similarity (STS-B, MRPC, QQP), the final hidden state of the [CLS] token CRHC \in \mathbb{R}^H is fed into a classifier.

P(yx)=softmax(WC+b)P(y | x) = \text{softmax}(W \cdot C + b)

Here, WRK×HW \in \mathbb{R}^{K \times H} and KK is the number of labels.

For sentence pair tasks, the input format is:

[CLS] Sentence A [SEP] Sentence B [SEP]

Segment Embedding distinguishes Sentence A from B, and the [CLS] representation encodes the relationship between the two sentences.

6.3 Question Answering (QA)

In SQuAD (Stanford Question Answering Dataset), given a question and passage, the model predicts the start and end positions of the answer span in the passage.

[CLS] Question [SEP] Passage [SEP]

Start position vector SRHS \in \mathbb{R}^H and end position vector ERHE \in \mathbb{R}^H are learned, and the start/end probabilities are computed for each token ii in the passage.

Pstart(i)=eSTijeSTj,Pend(i)=eETijeETjP_{\text{start}}(i) = \frac{e^{S \cdot T_i}}{\sum_j e^{S \cdot T_j}}, \quad P_{\text{end}}(i) = \frac{e^{E \cdot T_i}}{\sum_j e^{E \cdot T_j}}

Here, TiT_i is the final hidden state of token ii. The score of candidate answer span (i,j)(i, j) is computed as STi+ETjS \cdot T_i + E \cdot T_j (jij \geq i).

For SQuAD v2.0, which also requires handling cases with no answer, both start and end positions are set to [CLS] when there is no answer, and a "no answer" probability is computed.

6.4 Named Entity Recognition (NER)

In the CoNLL-2003 NER task, each token is classified as Person, Organization, Location, Miscellaneous, or Other. Here, the final hidden state of each token is fed into the classifier, not the [CLS] representation.

P(yix)=softmax(WNERTi+bNER)P(y_i | x) = \text{softmax}(W_{\text{NER}} \cdot T_i + b_{\text{NER}})

For subword tokens split by WordPiece, the prediction of the first subword is typically used as the label for the corresponding word.

6.5 Sequence Labeling Generalization

The same approach can be applied to all tasks that assign a label to each token in a sequence, including NER, POS Tagging, and Chunking. Because each of BERT's token representations sufficiently encodes bidirectional context, high performance is achieved without additional sequence modeling layers like CRF (Conditional Random Field).

6.6 Feature-based Approach

BERT can also be used as a Feature-based approach beyond Fine-tuning. The Transformer Encoder is frozen, and hidden states from specific layers are extracted as features to be fed into a separate model.

In the paper's experiment applying the Feature-based approach to CoNLL-2003 NER, concatenating the hidden states of the last 4 layers achieved 96.1% F1 (dev), close to the Fine-tuning approach's 96.4% F1 (dev). This demonstrates that BERT's pre-trained representations themselves contain very rich linguistic information.


7. Summary of Key Formulas

7.1 Self-Attention (Scaled Dot-Product Attention)

The Self-Attention formula performed in each Encoder layer of BERT.

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Here, Q=XWQQ = XW^Q, K=XWKK = XW^K, V=XWVV = XW^V, and dk=H/Ad_k = H / A is the dimension of each Attention Head.

7.2 Multi-Head Attention

MultiHead(Q,K,V)=Concat(head1,...,headA)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_A) \cdot W^O headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

BERT-Base uses A=12A=12 and BERT-Large uses A=16A=16 heads.

7.3 Feed-Forward Network

The Position-wise FFN after each Attention Sub-layer.

FFN(x)=GELU(xW1+b1)W2+b2\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2

Here, W1RH×4HW_1 \in \mathbb{R}^{H \times 4H} and W2R4H×HW_2 \in \mathbb{R}^{4H \times H}. It is a Bottleneck structure where the inner dimension is 4 times the Hidden Size.

7.4 MLM Loss (Summary)

Cross-Entropy Loss over the set of masked tokens M\mathcal{M}.

LMLM=1MiMlogexp(Wxihi)v=1Vexp(Wvhi)\mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|}\sum_{i \in \mathcal{M}} \log \frac{\exp(W_{x_i} \cdot h_i)}{\sum_{v=1}^{|V|} \exp(W_v \cdot h_i)}

Here, hih_i is the final hidden state at masked position ii, WvW_v is the vv-th row of the output Embedding matrix, and V=30,000|V| = 30,000.

7.5 NSP Loss

LNSP=[ylogP(IsNext)+(1y)logP(NotNext)]\mathcal{L}_{\text{NSP}} = -\left[y \log P(\text{IsNext}) + (1-y) \log P(\text{NotNext})\right]

Here, y{0,1}y \in \{0, 1\} is the actual label.

7.6 Total Pre-training Objective

Lpretrain=LMLM+LNSP\mathcal{L}_{\text{pretrain}} = \mathcal{L}_{\text{MLM}} + \mathcal{L}_{\text{NSP}}

8. Experimental Results

8.1 GLUE Benchmark

GLUE (General Language Understanding Evaluation) is a benchmark consisting of 9 NLU tasks. BERT-Large achieved an overall GLUE score of 80.5%, recording a 7.7% absolute improvement over the previous SOTA.

TaskMetricBERT-BaseBERT-LargePrevious SOTA
MNLI-m / MNLI-mmAccuracy84.6 / 83.486.7 / 85.980.6 / 80.1
QQPF171.272.166.1
QNLIAccuracy90.592.787.4
SST-2Accuracy93.594.993.2
CoLAMatthews Corr52.160.535.0
STS-BSpearman Corr85.886.581.0
MRPCF188.989.386.0
RTEAccuracy66.470.161.7
WNLIAccuracy-65.165.1

The 25.5% absolute improvement over the previous SOTA on CoLA is particularly impressive. CoLA (Corpus of Linguistic Acceptability) is a task that judges grammatical acceptability of sentences, requiring deep language understanding.

8.2 SQuAD v1.1

SQuAD v1.1 (Stanford Question Answering Dataset) is an Extractive QA task that extracts answer spans from passages.

ModelEM (Exact Match)F1
Previous SOTA (single model)84.491.0
BERT-Large (single model)84.190.9
Previous SOTA (ensemble)86.791.7
BERT-Large + TriviaQA (ensemble)87.493.2

The ensemble model leveraging TriviaQA data achieved F1 of 93.2, surpassing human performance (91.2 F1) for the first time.

8.3 SQuAD v2.0

SQuAD v2.0 adds unanswerable questions to v1.1.

ModelEMF1
Previous SOTA73.777.0
BERT-Large (single model)80.083.1

It achieved a 6.1 point improvement in F1 over the previous SOTA. BERT's bidirectional context understanding showed particular strength in the ability to determine whether a question is answerable.

8.4 SWAG

SWAG (Situations With Adversarial Generations) is a commonsense reasoning task that selects the most appropriate continuation sentence from 4 candidates given a context sentence.

ModelDev AccuracyTest Accuracy
Human-88.0
ESIM + ELMo51.952.7
OpenAI GPT-78.0
BERT-Base81.6-
BERT-Large86.686.3

BERT-Large surpassed GPT by 8.3% and approached human performance (88.0%).

8.5 Ablation Study: Impact of Pre-training Tasks

The paper analyzed the importance of pre-training tasks using BERT-Base as the baseline.

SettingMNLI-mQNLIMRPCSST-2SQuAD F1
BERT-Base (MLM + NSP)84.488.486.792.788.5
No NSP83.984.986.592.687.9
LTR (Left-to-Right) & No NSP82.184.377.592.177.8
LTR + BiLSTM & No NSP82.184.175.791.684.9

Key findings:

  1. NSP Removal: 3.5% drop on QNLI; significant impact on tasks where sentence-pair relationships matter
  2. Left-to-Right: 10.7 F1 drop on SQuAD compared to bidirectional; bidirectionality is critical for token-level tasks
  3. LTR + BiLSTM: Even adding a BiLSTM cannot substitute for bidirectional pre-training

8.6 Ablation Study: Impact of Model Size

SettingLHAParamsMNLI-mMRPCSST-2SQuAD F1
3-layer37681245M77.979.888.475.6
6-layer67681267M80.684.391.183.7
BERT-Base1276812110M84.486.792.788.5
BERT-Large24102416340M86.687.893.791.3

Increasing model size resulted in consistent performance improvements across all tasks. It is particularly noteworthy that the larger model performed better even on small datasets (MRPC: 3,600 training examples). This suggests that pre-training provides sufficient knowledge to effectively train large models even with limited Fine-tuning data.


9. Code Examples: HuggingFace Transformers

9.1 Masked Language Model Prediction with BERT

from transformers import BertTokenizer, BertForMaskedLM
import torch

# Load model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
model.eval()

# Prepare masked sentence
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")

# Find [MASK] position
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Extract Top-5 predicted tokens
mask_logits = logits[0, mask_token_index, :]
top5_tokens = torch.topk(mask_logits, 5, dim=1)

print(f"Input: {text}")
print("Top-5 Predictions:")
for i, (token_id, score) in enumerate(
    zip(top5_tokens.indices[0], top5_tokens.values[0])
):
    token = tokenizer.decode([token_id])
    print(f"  {i+1}. {token} (score: {score:.4f})")

Example output:

Input: The capital of France is [MASK].
Top-5 Predictions:
  1. paris (score: 18.2341)
  2. lyon (score: 12.1456)
  3. lille (score: 10.8923)
  4. toulouse (score: 10.5678)
  5. marseille (score: 10.3210)

9.2 Sentence Classification (Sentiment Analysis) Fine-tuning

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# Load dataset (SST-2)
dataset = load_dataset("glue", "sst2")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenize
def tokenize_function(examples):
    return tokenizer(
        examples["sentence"],
        padding="max_length",
        truncation=True,
        max_length=128,
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Initialize model (2-class classification)
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
)

# Training settings (paper's recommended hyperparameters)
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,              # Paper recommendation: 2-4
    per_device_train_batch_size=32,  # Paper recommendation: 16 or 32
    learning_rate=2e-5,              # Paper recommendation: 2e-5 ~ 5e-5
    weight_decay=0.01,
    warmup_ratio=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

# Train with Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()

9.3 Question Answering

from transformers import BertTokenizer, BertForQuestionAnswering
import torch

tokenizer = BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = BertForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model.eval()

# Question and passage
question = "What is BERT?"
context = """BERT is a language representation model developed by Google.
It stands for Bidirectional Encoder Representations from Transformers.
BERT is designed to pre-train deep bidirectional representations from
unlabeled text by jointly conditioning on both left and right context
in all layers."""

# Tokenize
inputs = tokenizer(question, context, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

# Extract answer
start_idx = torch.argmax(start_logits)
end_idx = torch.argmax(end_logits)

answer_tokens = inputs["input_ids"][0][start_idx : end_idx + 1]
answer = tokenizer.decode(answer_tokens)
print(f"Question: {question}")
print(f"Answer: {answer}")
# Output: Answer: a language representation model developed by google

9.4 Named Entity Recognition (NER)

from transformers import pipeline

# Create NER pipeline
ner_pipeline = pipeline(
    "ner",
    model="bert-base-cased",  # Case sensitivity is important for NER
    aggregation_strategy="simple",
)

text = "Google released BERT in 2018 at their Mountain View headquarters."
results = ner_pipeline(text)

for entity in results:
    print(f"Entity: {entity['word']:20s} | "
          f"Label: {entity['entity_group']:5s} | "
          f"Score: {entity['score']:.4f}")

9.5 BERT Embedding Extraction (Feature-based Approach)

from transformers import BertModel, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
model.eval()

text = "BERT produces contextualized word embeddings."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# All layers' hidden states (13: embedding + 12 layers)
all_hidden_states = outputs.hidden_states  # tuple of (batch, seq_len, 768)

# Sentence representation using [CLS] token (last layer)
sentence_embedding = outputs.last_hidden_state[:, 0, :]
print(f"Sentence embedding dimension: {sentence_embedding.shape}")  # (1, 768)

# Paper's approach: concatenate last 4 layers (Feature-based)
last_4_layers = torch.cat(
    [all_hidden_states[i] for i in [-4, -3, -2, -1]], dim=-1
)
print(f"Last 4 layers concatenation dimension: {last_4_layers.shape}")  # (1, seq_len, 3072)

10. BERT's Impact and Subsequent Research

BERT brought a paradigm shift to the field of NLP. The "pre-training + Fine-tuning" methodology has become the standard in NLP, and numerous follow-up studies aimed at overcoming BERT's limitations have emerged.

10.1 RoBERTa (Liu et al., 2019, Meta AI)

Robustly Optimized BERT Pretraining Approach. Without changing BERT's architecture, it significantly improved performance by optimizing only the training methodology.

Key changes:

  • NSP Removal: Experimentally demonstrated that NSP actually hurts performance
  • Dynamic Masking: While BERT fixes masking during data preprocessing (static), RoBERTa changes the masking pattern every epoch
  • More Data: Increased 10x from 16GB to 160GB (added CC-News, OpenWebText, Stories)
  • Larger Batches: Increased from 256 to 8,000
  • Longer Training: 50,000 BPE vocabulary, more training steps

Result: Surpassed BERT-Large on all tasks across GLUE, SQuAD, and RACE. This suggested that BERT's pre-training had been severely undertrained.

10.2 ALBERT (Lan et al., 2019, Google)

A Lite BERT. A model focused on parameter efficiency.

Key techniques:

  • Factorized Embedding Parameterization: Separates the vocabulary Embedding dimension (EE) from the Hidden dimension (HH). Uses V×E+E×HV \times E + E \times H instead of V×HV \times H for parameter reduction. Example: with E=128E=128, 30000×768=23M30000 \times 768 = 23M parameters are reduced to 30000×128+128×768=3.9M30000 \times 128 + 128 \times 768 = 3.9M
  • Cross-layer Parameter Sharing: All Transformer layers share the same parameters. Greatly reduces parameter count with minimal performance degradation
  • Sentence Order Prediction (SOP): Replaces NSP with a task that determines whether two consecutive sentences are in the correct order. A harder task than NSP, more effective for learning inter-sentence relationships

Result: Achieved comparable or better performance with 18x fewer parameters than BERT-Large.

10.3 DistilBERT (Sanh et al., 2019, Hugging Face)

A model that compressed BERT using Knowledge Distillation.

  • 40% the size of BERT-Base (66M parameters, 6 layers)
  • 60% faster than BERT-Base
  • Retains 97% of BERT-Base performance
  • Uses Distillation Loss that mimics the Teacher's (BERT-Base) Soft Labels during training
  • 86.9 F1 on SQuAD v1.1 (BERT-Base: 88.5 F1)

It became the starting point for lightweight models suitable for mobile and edge device deployment.

10.4 ELECTRA (Clark et al., 2020, Google/Stanford)

Efficiently Learning an Encoder that Classifies Token Replacements Accurately. A model that fundamentally improved the training inefficiency of BERT's MLM.

Core idea:

  • Generator-Discriminator Architecture: A small Generator (MLM) fills in masked tokens, and the Discriminator determines whether each token is original or generated by the Generator
  • Learning from All Tokens: While BERT learns only from the 15% of masked tokens, ELECTRA learns "original vs. replaced" discrimination from all tokens, greatly improving training efficiency
  • Replaced Token Detection (RTD): A more efficient pre-training objective than MLM

Result: Surpassed BERT, RoBERTa, and ALBERT under the same compute budget. The efficiency difference was especially pronounced for small models (ELECTRA-Small).

10.5 DeBERTa (He et al., 2020, Microsoft)

Decoding-enhanced BERT with Disentangled Attention. A model that improved the Attention mechanism itself.

Key innovations:

  • Disentangled Attention: Existing BERT sums Token Embedding and Position Embedding before performing Attention, but DeBERTa performs separate Attention for Content and Position. It combines three types of Attention: Content-to-Content, Content-to-Position, and Position-to-Content
  • Enhanced Mask Decoder: Injects absolute position information at the decoding layer, utilizing both relative and absolute position information

Result: Higher performance at the same model size compared to BERT and RoBERTa. It surpassed existing models with only half the training data and exceeded human performance on SuperGLUE.

10.6 Follow-up Research Lineage

BERT (2018)
├── RoBERTa (2019) ─── Training methodology optimization
├── ALBERT (2019) ─── Parameter efficiency
├── DistilBERT (2019) ─── Knowledge Distillation
├── ELECTRA (2020) ─── Training efficiency (RTD)
├── DeBERTa (2020) ─── Attention mechanism improvement
├── SpanBERT (2020) ─── Span-level masking
├── ERNIE (2019) ─── Knowledge-enhanced pre-training
└── ModernBERT (2024) ─── Modern architecture application

11. Limitations and Lessons

11.1 Limitations of BERT

1. Pre-train/Fine-tune Discrepancy

The [MASK] token appears only during pre-training and never during Fine-tuning. The 80/10/10 strategy mitigates but does not fundamentally resolve this. ELECTRA's RTD addressed this problem more elegantly.

2. Independence Assumption Between Masked Tokens

MLM predicts masked tokens independently. For example, if "New York" is fully masked, "New" and "York" are predicted independently of each other. In reality, there is a strong dependency between these two tokens. XLNet (Yang et al., 2019) attempted to solve this with Permutation Language Modeling.

3. Training Inefficiency

Since only 15% of tokens are predicted per batch, more training steps are needed for convergence. ELECTRA significantly improved this inefficiency by learning from all tokens.

4. Maximum Sequence Length Limitation

The 512-token maximum length limitation is unsuitable for processing long documents. Due to Self-Attention's O(n2)O(n^2) complexity, it is difficult to significantly increase the sequence length. Longformer and BigBird mitigated this problem with Sparse Attention.

5. Unsuitable for Generation Tasks

As an Encoder-only model, BERT cannot be directly used for text generation tasks (summarization, translation, dialogue, etc.). Generation tasks are better suited for GPT-family (Decoder-only) or T5/BART-family (Encoder-Decoder) models.

6. Questionable NSP Effectiveness

RoBERTa's experiments revealed that NSP can actually hurt performance. The problem is that NSP is too easy a task — randomly selected sentences can be easily distinguished by topic differences alone.

11.2 Lessons from the Paper

1. The Power of Simple Ideas

BERT's core idea (learning bidirectional context) is remarkably simple. It achieved dramatic performance improvements not through complex architectural innovation but through a change in the pre-training objective (MLM).

2. The Effect of Scaling

The Ablation Study demonstrated that consistently improving performance by increasing model and data size. This became the cornerstone of subsequent Scaling Law research and the development of large-scale models like GPT-3.

3. Establishing the Transfer Learning Paradigm

Just as ImageNet pre-training transformed Computer Vision, BERT proved that leveraging pre-trained language models should be the starting point for every task in NLP. Designing task-specific architectures from scratch was no longer necessary.

4. The Importance of Fair Comparison

By designing BERT-Base to match GPT-1 in size, the paper clearly showed that the difference in methodology, not architecture size, was the cause of performance improvement. This is an important lesson in research paper experimental design.


12. Conclusion

BERT completely changed the landscape of NLP since its publication in 2018. It implemented the simple yet powerful idea of bidirectional context learning through the elegant method of Masked Language Model, and proved its effectiveness by sweeping 11 benchmarks simultaneously.

BERT's greatest legacy is not the numbers on specific benchmarks. It is the establishment of the paradigm: "Pre-train on a large-scale unsupervised corpus, then Fine-tune with a small amount of labeled data." This paradigm has become the foundation of modern language models from GPT-3, T5, and PaLM to LLaMA.

Of course, BERT had its limitations. Issues such as masking mismatch, training inefficiency, and sequence length limitations have been addressed one by one by subsequent research including RoBERTa, ELECTRA, and Longformer. However, the fact that all these follow-up studies were built upon the foundation of BERT speaks to the historical significance of the BERT paper.


References