- Authors
- Name
- 1. Paper Overview
- 2. Background and Motivation: The Limitations of RNN/LSTM
- 3. Self-Attention Mechanism
- 4. Scaled Dot-Product Attention
- 5. Multi-Head Attention
- 6. Positional Encoding
- 7. Full Encoder-Decoder Architecture
- 8. Feed-Forward Network, Layer Normalization, Residual Connection
- 9. Training Strategy
- 10. Key Experimental Results
- 11. Impact on Subsequent Research
- 12. Core PyTorch Code Examples
- 13. Conclusion
- References
1. Paper Overview
"Attention Is All You Need" is a paper presented at NeurIPS 2017, co-authored by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin from Google Brain and Google Research. This paper demonstrated that a Sequence-to-Sequence model could be constructed using only the Attention mechanism, completely eliminating traditional Recurrence and Convolution — a true turning point in the history of deep learning.
The Transformer architecture proposed in the paper achieved 28.4 BLEU on the WMT 2014 English-to-German translation task and 41.8 BLEU on English-to-French, surpassing all existing models. More importantly, this architecture subsequently became the foundation for virtually all major modern AI models, including BERT, GPT, T5, and ViT.
2. Background and Motivation: The Limitations of RNN/LSTM
2.1 The Bottleneck of Sequential Processing
Before the Transformer, the standard for Sequence Modeling was RNN (Recurrent Neural Network) and its variants LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit). These architectures process sequences in order , updating the hidden state at each step.
This sequential nature gave rise to two fundamental problems.
First, parallelization was impossible. Since the computation at each time step depends on the result of the previous step, the parallel processing capabilities of GPUs could not be effectively utilized. Training time increased linearly as sequence length grew.
Second, the Long-range Dependency problem. Although LSTM was theoretically capable of learning long-term dependencies, in practice it became increasingly difficult to capture relationships between distant tokens as sequences grew longer. This is because all past information must be compressed into a fixed-size vector called the hidden state.
2.2 The Emergence and Limitations of Attention
The Attention mechanism proposed by Bahdanau et al. (2014) greatly alleviated the Long-range Dependency problem by allowing the Decoder to directly access all of the Encoder's hidden states. However, since Attention was still added on top of RNNs, the bottleneck of Sequential Processing remained.
The paper's core question was precisely this: "Is Attention alone sufficient, without Recurrence?"
The answer was Yes, and the result was the Transformer.
3. Self-Attention Mechanism
3.1 Core Concept: Query, Key, Value
The key idea behind Self-Attention is that each token in a sequence directly computes its relationship with every other token. To achieve this, each input vector is transformed into three roles.
- Query (Q): "What information am I looking for?"
- Key (K): "What is the identifier of the information I can provide?"
- Value (V): "What is the actual information I convey?"
Given an input sequence , Q, K, and V are generated through learnable weight matrices.
where and .
3.2 Intuitive Understanding
An information retrieval analogy makes this easier to understand. Imagine searching for a book in a library: the Query is the search term "deep learning introductory book," the Key is each book's title or tag, and the Value is the actual content of the book. The essence of Self-Attention is retrieving more Value from books whose Key has higher similarity to the Query.
The decisive difference between Self-Attention and RNNs is that the path length between any two tokens in the sequence is always . For RNNs it is , and for CNNs it is (dilated) or (standard). This short path length is what enables effective learning of Long-range Dependencies.
4. Scaled Dot-Product Attention
4.1 Formula
The exact formula for the Attention function proposed in the paper is as follows.
Let us decompose this formula step by step.
Step 1: Similarity Computation ()
The Dot Product of Query and Key is computed. The result is an Attention Score matrix. Each element represents the similarity between the Query of the -th token and the Key of the -th token.
Step 2: Scaling ()
As grows larger, the variance of the Dot Product values increases, causing the gradients of Softmax to become extremely small. Specifically, if each component of and is an independent random variable with mean 0 and variance 1, the variance of is . Dividing by normalizes the variance to 1, allowing the Softmax to operate stably.
The paper also confirmed the importance of this Scaling experimentally: when was small, Additive Attention and Dot-Product Attention performed similarly, but when was large, Dot-Product Attention without Scaling degraded significantly.
Step 3: Softmax
Softmax is applied to the scaled scores to obtain Attention Weights. Since each row sums to 1, these serve as weights for a weighted average over the Values.
Step 4: Weighted Sum with Values
Finally, the matrix multiplication of Attention Weights and Values yields an output for each token that is a vector summing all tokens' Values proportionally to their relevance.
4.2 Masking
In the Decoder's Self-Attention, information from future tokens must be prevented from leaking to the current token. To achieve this, Masked Attention sets the scores at future positions to before Softmax.
Here, is an Upper Triangular Matrix with at disallowed positions and at allowed positions.
5. Multi-Head Attention
5.1 Limitations of Single Attention
Using only a single Attention function forces the model to capture token relationships from only one perspective. For example, in the sentence "The cat sat on the mat because it was tired," it becomes difficult to simultaneously capture the syntactic relationship that "it" refers to "cat" and the semantic relationship that "tired" describes the state of "cat."
5.2 Multi-Head Attention Structure
The paper solved this problem by running multiple Attention functions in parallel.
where each head is defined as follows.
The weight matrices for each head are , , , and the final output projection is .
5.3 Paper's Configuration
The paper used heads with . Because the total dimension is divided by the number of heads, the total computational cost of Multi-Head Attention is nearly identical to that of Single-Head Attention.
According to the paper's Ablation Study, BLEU dropped by 0.9 points when only 1 head was used, and when there were too many heads (e.g., 32), became too small, actually hurting performance.
5.4 Three Usage Patterns
Multi-Head Attention is used in three places within the Transformer.
- Encoder Self-Attention: Within the Encoder, each token of the input sequence attends to all other tokens. Q, K, and V are all generated from the output of the previous Encoder layer.
- Decoder Self-Attention (Masked): Masked Attention within the Decoder that can only reference tokens generated so far.
- Encoder-Decoder Attention (Cross-Attention): The Decoder's Query attends to the Encoder's Key and Value. This is the component most similar to the Attention in conventional Seq2Seq models.
6. Positional Encoding
6.1 Necessity
Self-Attention is inherently order-agnostic (permutation invariant). Even if the order of input tokens is shuffled, the Attention output values remain the same (only their order changes). Since word order carries crucial information in natural language, positional information must be explicitly injected.
6.2 Sinusoidal Positional Encoding
The paper proposed Positional Encoding using sine and cosine functions.
Here, is the position within the sequence and is the dimension index. This Encoding is added element-wise to the input Embedding and passed to the model.
6.3 Why Sinusoidal?
There are clear reasons for choosing this function.
Relative Position Representation: For any fixed offset , can be expressed as a linear transformation of . This enables the model to easily learn relative positional relationships.
Generalization without Learning: The encoding can naturally extend to longer sequences not seen during training. When compared with learnable Positional Embeddings, the paper reported that both approaches yielded "nearly identical results," and ultimately chose the Sinusoidal approach for its generalization capability.
Frequency Spectrum: Lower dimensions (smaller ) have shorter wavelengths for fine-grained position distinction, while higher dimensions have longer wavelengths for encoding broader positional relationships.
7. Full Encoder-Decoder Architecture
7.1 Encoder Structure
The Encoder consists of identical layers. Each layer has two Sub-layers.
- Multi-Head Self-Attention
- Position-wise Feed-Forward Network
Residual Connection and Layer Normalization are applied to each Sub-layer.
7.2 Decoder Structure
The Decoder also consists of identical layers, but unlike the Encoder, each layer has three Sub-layers.
- Masked Multi-Head Self-Attention: Masks future positions to maintain the auto-regressive property.
- Multi-Head Cross-Attention: Uses the Encoder's output as Key and Value.
- Position-wise Feed-Forward Network
7.3 Overall Flow
The input sequence passes through Embedding + Positional Encoding and enters the Encoder, and the Encoder output after 6 layers is passed to the Decoder's Cross-Attention. The Decoder takes previously generated tokens as input and outputs a probability distribution over the next token, repeating this process until the end-of-sequence token is produced. The output dimension of all Sub-layers is unified at .
8. Feed-Forward Network, Layer Normalization, Residual Connection
8.1 Position-wise Feed-Forward Network (FFN)
A Position-wise FFN follows each Attention Sub-layer. "Position-wise" means it is applied independently to each position (token) with shared weights.
This is a structure with a ReLU activation function sandwiched between two Linear Transformations. The input and output dimensions are , and the inner dimension is . In other words, it is a Bottleneck structure that expands by a factor of 4 and then contracts back to the original size.
This FFN is equivalent to two 1x1 Convolutions and performs nonlinear transformations on each token, converting the relational information captured by Attention into richer representations.
8.2 Residual Connection
This is a Skip Connection that adds the input of each Sub-layer to its output.
This design, borrowed from ResNet, stabilizes training by allowing gradients to flow smoothly through deep networks. For Residual Connections to work properly, the dimensions of the two tensors being added must be identical, which is why the output dimensions of all Sub-layers and Embeddings are unified at .
8.3 Layer Normalization
Layer Normalization is applied to the output of each Sub-layer. Unlike Batch Normalization, Layer Normalization normalizes across all Features within a single sample, making it independent of batch size.
Here, and are the mean and standard deviation across all dimensions of the layer, and and are learnable parameters. The paper used the Post-Norm approach (applying LN after the Sublayer output + Residual).
9. Training Strategy
9.1 Optimizer and Learning Rate Schedule
The paper used the Adam Optimizer with a distinctive Learning Rate schedule. This schedule later became widely known as the "Noam Scheduler."
The key feature of this schedule is Warmup. During the first (4,000 steps in the paper), the Learning Rate increases linearly, and afterwards it decreases proportionally to the inverse square root of the step number.
Warmup is necessary because Adam's second moment estimates are unstable in the early stages of training. Keeping the Learning Rate low initially prevents parameters from changing drastically and allows training to begin in earnest once moment estimates have stabilized.
The Adam Optimizer hyperparameters were , , . It is noteworthy that was set to 0.98, lower than the typical 0.999, which is interpreted as adapting to the rapid changes in Attention Score distributions.
9.2 Regularization
Residual Dropout: Dropout (rate = 0.1) is applied to the output of each Sub-layer before the Residual Connection. Dropout is also applied to the sum of Embedding + Positional Encoding in both the Encoder and Decoder.
Label Smoothing: Label Smoothing with was applied. This technique sets the target probability of the correct class to rather than 1, and the target probability of other classes to . The paper reported that Label Smoothing worsens Perplexity but improves Accuracy and BLEU Score. This is because it prevents the model from becoming overconfident, thereby improving generalization performance.
9.3 Training Data and Hardware
- WMT 2014 English-German: Approximately 4.5 million sentence pairs, using Byte-Pair Encoding (BPE) with a shared vocabulary of approximately 37,000 tokens
- WMT 2014 English-French: Approximately 36 million sentence pairs, using a 32,000 Word-piece vocabulary
- Batch: Containing approximately 25,000 Source tokens + 25,000 Target tokens
- Hardware: 8 NVIDIA P100 GPUs
- Training Time: Approximately 12 hours for the Base model (100K steps), approximately 3.5 days for the Big model (300K steps)
10. Key Experimental Results
10.1 Machine Translation Performance
| Model | EN-DE BLEU | EN-FR BLEU | Training Cost (FLOPs) |
|---|---|---|---|
| Transformer (Base) | 27.3 | 38.1 | |
| Transformer (Big) | 28.4 | 41.8 | |
| Previous SOTA (including Ensemble) | 26.36 | 41.29 | - |
The Transformer Big model surpassed the previous best performance on EN-DE by more than 2 BLEU and set a new SOTA on EN-FR as well. What is even more remarkable is that this performance was achieved at a fraction of the training cost of existing models.
10.2 Model Size Comparison
| Config | Parameters | |||||
|---|---|---|---|---|---|---|
| Base | 6 | 512 | 2048 | 8 | 64 | 65M |
| Big | 6 | 1024 | 4096 | 16 | 64 | 213M |
10.3 Key Ablation Study Results
The paper's Ablation Study clearly demonstrates the importance of each design decision.
- Number of Attention Heads: resulted in a 0.9 BLEU drop; or caused performance degradation because became too small
- (Key Dimension): Reducing it led to quality degradation. It directly affects the representational capacity of Dot-Product Attention
- (Model Dimension): Performance consistently improved with larger values
- Dropout: Without it, overfitting occurred with significant performance drops
- Positional Encoding: Learnable and Sinusoidal approaches achieved nearly identical performance
10.4 English Constituency Parsing
To verify generalization ability beyond translation, the model was also applied to English Constituency Parsing. It achieved 91.3 F1 using only WSJ data and 92.7 F1 in a semi-supervised setting, showing competitive performance with task-specific models. This demonstrated that the Transformer is a general-purpose sequence model not limited to machine translation.
11. Impact on Subsequent Research
The Transformer architecture has become the foundation for virtually all major advances in modern AI.
11.1 BERT (2018, Google)
Using only the Encoder portion of the Transformer, BERT performed bidirectional pre-training. Through two pre-training tasks — Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) — it achieved SOTA on 11 NLP benchmarks. BERT established the Transfer Learning paradigm in NLP.
11.2 GPT Series (2018~, OpenAI)
Using only the Decoder portion of the Transformer, GPT performed Auto-regressive Language Modeling. Scaling up from GPT-1 (117M) to GPT-2 (1.5B) to GPT-3 (175B) demonstrated the power of Scaling Laws. GPT-3 showcased Few-shot Learning capabilities, opening new possibilities for AI and becoming the starting point for the Large Language Model (LLM) revolution that led to ChatGPT and GPT-4.
11.3 Beyond
- T5 (2019): Unified all NLP tasks into a Text-to-Text format, using the full Encoder-Decoder structure
- ViT (2020): Applied the Transformer to Computer Vision, dividing images into patches and processing them as sequences
- DALL-E, Stable Diffusion: Leveraged Transformers for image generation
- AlphaFold 2: Utilized the Attention mechanism for protein structure prediction
A single paper has transformed nearly every field of AI, from NLP to Computer Vision, biology, music, and robotics.
12. Core PyTorch Code Examples
12.1 Scaled Dot-Product Attention
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
def scaled_dot_product_attention(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
mask: torch.Tensor = None,
dropout: nn.Dropout = None
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Scaled Dot-Product Attention implementation.
Args:
query: (batch, h, seq_len, d_k)
key: (batch, h, seq_len, d_k)
value: (batch, h, seq_len, d_v)
mask: Attention mask (optional)
Returns:
output: (batch, h, seq_len, d_v)
attention_weights: (batch, h, seq_len, seq_len)
"""
d_k = query.size(-1)
# Step 1 & 2: QK^T / sqrt(d_k)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
# Masking (for Decoder Self-Attention, etc.)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Step 3: Softmax
attention_weights = F.softmax(scores, dim=-1)
if dropout is not None:
attention_weights = dropout(attention_weights)
# Step 4: Weighted sum of values
output = torch.matmul(attention_weights, value)
return output, attention_weights
12.2 Multi-Head Attention
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int = 512, h: int = 8, dropout: float = 0.1):
super().__init__()
assert d_model % h == 0, "d_model must be divisible by h"
self.d_model = d_model
self.h = h
self.d_k = d_model // h
# Linear projections for Q, K, V, and Output
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def forward(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
mask: torch.Tensor = None
) -> torch.Tensor:
batch_size = query.size(0)
# 1) Linear projection then reshape to (batch, h, seq_len, d_k)
Q = self.W_q(query).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
# 2) Scaled Dot-Product Attention (all heads in parallel)
attn_output, attn_weights = scaled_dot_product_attention(
Q, K, V, mask=mask, dropout=self.dropout
)
# 3) Concatenate head results: (batch, seq_len, d_model)
attn_output = (
attn_output.transpose(1, 2)
.contiguous()
.view(batch_size, -1, self.d_model)
)
# 4) Final linear projection
return self.W_o(attn_output)
12.3 Positional Encoding
class PositionalEncoding(nn.Module):
def __init__(self, d_model: int = 512, max_len: int = 5000, dropout: float = 0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)
# Create Positional Encoding matrix of size (max_len, d_model)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) # (max_len, 1)
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
) # (d_model/2,)
pe[:, 0::2] = torch.sin(position * div_term) # Even dimensions: sin
pe[:, 1::2] = torch.cos(position * div_term) # Odd dimensions: cos
pe = pe.unsqueeze(0) # (1, max_len, d_model) - add batch dimension
self.register_buffer('pe', pe) # Register as buffer, not a learnable parameter
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: (batch, seq_len, d_model) - Embedding output
Returns:
(batch, seq_len, d_model) - Result with Positional Encoding added
"""
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
12.4 Transformer Encoder Layer
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model: int = 512, h: int = 8, d_ff: int = 2048, dropout: float = 0.1):
super().__init__()
# Sub-layer 1: Multi-Head Self-Attention
self.self_attn = MultiHeadAttention(d_model, h, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
# Sub-layer 2: Position-wise FFN
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
)
self.norm2 = nn.LayerNorm(d_model)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
# Sub-layer 1: Self-Attention + Residual + LayerNorm
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout1(attn_output))
# Sub-layer 2: FFN + Residual + LayerNorm
ffn_output = self.ffn(x)
x = self.norm2(x + self.dropout2(ffn_output))
return x
In the code above, self.self_attn(x, x, x, mask) passes the same input x for Q, K, and V, which is why it is called "Self"-Attention. For Cross-Attention, the Encoder output would be passed for K and V.
13. Conclusion
"Attention Is All You Need" is not merely a paper proposing a single machine translation model. It is a paper that broke the long-standing inertia of Recurrence and empirically proved the bold claim that Attention alone is sufficient.
The core contributions of this paper can be summarized as follows.
- Elimination of Recurrence: Dramatically improved training speed with a parallelizable architecture
- Self-Attention: Directly models relationships between all tokens in a sequence with path length
- Multi-Head Attention: Captures relationships simultaneously from multiple perspectives
- Scalability: A simple yet scalable architectural design that enables scaling up to billions and even trillions of parameters
Since its publication in 2017, the Transformer has spread from NLP to Vision, Audio, Biology, Robotics, and nearly every other domain of AI. True to the paper's title, Attention really was All You Need.
References
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762
- Full paper (HTML version): https://arxiv.org/html/1706.03762v7
- NeurIPS Official PDF: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf
- Jay Alammar, The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
- Harvard NLP, The Annotated Transformer: http://nlp.seas.harvard.edu/2018/04/03/attention.html
- Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers. https://arxiv.org/abs/1810.04805
- Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-Training (GPT). https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
- UvA Deep Learning Tutorials - Transformers and Multi-Head Attention: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html
- Wikipedia - Attention Is All You Need: https://en.wikipedia.org/wiki/Attention_Is_All_You_Need