Attention Is All You Need - A Complete Analysis of the Transformer Paper

1. Paper Overview
2. Background and Motivation: The Limitations of RNN/LSTM
- 2.1 The Bottleneck of Sequential Processing
- 2.2 The Emergence and Limitations of Attention
3. Self-Attention Mechanism
- 3.1 Core Concept: Query, Key, Value
- 3.2 Intuitive Understanding
4. Scaled Dot-Product Attention
- 4.1 Formula
- 4.2 Masking
5. Multi-Head Attention
6. Positional Encoding
7. Full Encoder-Decoder Architecture
8. Feed-Forward Network, Layer Normalization, Residual Connection
9. Training Strategy
10. Key Experimental Results
11. Impact on Subsequent Research
12. Core PyTorch Code Examples
13. Conclusion
References

1. Paper Overview

"Attention Is All You Need" is a paper presented at NeurIPS 2017, co-authored by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin from Google Brain and Google Research. This paper demonstrated that a Sequence-to-Sequence model could be constructed using only the Attention mechanism, completely eliminating traditional Recurrence and Convolution — a true turning point in the history of deep learning.

The Transformer architecture proposed in the paper achieved 28.4 BLEU on the WMT 2014 English-to-German translation task and 41.8 BLEU on English-to-French, surpassing all existing models. More importantly, this architecture subsequently became the foundation for virtually all major modern AI models, including BERT, GPT, T5, and ViT.

2. Background and Motivation: The Limitations of RNN/LSTM

2.1 The Bottleneck of Sequential Processing

Before the Transformer, the standard for Sequence Modeling was RNN (Recurrent Neural Network) and its variants LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit). These architectures process sequences in order $t = 1, 2, ..., n$ , updating the hidden state $h_t$ at each step.

h_t = f(h_{t-1}, x_t)

This sequential nature gave rise to two fundamental problems.

First, parallelization was impossible. Since the computation at each time step depends on the result of the previous step, the parallel processing capabilities of GPUs could not be effectively utilized. Training time increased linearly as sequence length grew.

Second, the Long-range Dependency problem. Although LSTM was theoretically capable of learning long-term dependencies, in practice it became increasingly difficult to capture relationships between distant tokens as sequences grew longer. This is because all past information must be compressed into a fixed-size vector called the hidden state.

2.2 The Emergence and Limitations of Attention

The Attention mechanism proposed by Bahdanau et al. (2014) greatly alleviated the Long-range Dependency problem by allowing the Decoder to directly access all of the Encoder's hidden states. However, since Attention was still added on top of RNNs, the bottleneck of Sequential Processing remained.

The paper's core question was precisely this: "Is Attention alone sufficient, without Recurrence?"

The answer was Yes, and the result was the Transformer.

3. Self-Attention Mechanism

3.1 Core Concept: Query, Key, Value

The key idea behind Self-Attention is that each token in a sequence directly computes its relationship with every other token. To achieve this, each input vector is transformed into three roles.

Query (Q): "What information am I looking for?"
Key (K): "What is the identifier of the information I can provide?"
Value (V): "What is the actual information I convey?"

Given an input sequence $X \in \mathbb{R}^{n \times d_{model}}$ , Q, K, and V are generated through learnable weight matrices.

Q = XW^Q, \quad K = XW^K, \quad V = XW^V

where $W^Q, W^K \in \mathbb{R}^{d_{model} \times d_k}$ and $W^V \in \mathbb{R}^{d_{model} \times d_v}$ .

3.2 Intuitive Understanding

An information retrieval analogy makes this easier to understand. Imagine searching for a book in a library: the Query is the search term "deep learning introductory book," the Key is each book's title or tag, and the Value is the actual content of the book. The essence of Self-Attention is retrieving more Value from books whose Key has higher similarity to the Query.

The decisive difference between Self-Attention and RNNs is that the path length between any two tokens in the sequence is always $O(1)$ . For RNNs it is $O(n)$ , and for CNNs it is $O(\log_k n)$ (dilated) or $O(n/k)$ (standard). This short path length is what enables effective learning of Long-range Dependencies.

4. Scaled Dot-Product Attention

4.1 Formula

The exact formula for the Attention function proposed in the paper is as follows.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Let us decompose this formula step by step.

Step 1: Similarity Computation ( $QK^T$ )

The Dot Product of Query and Key is computed. The result is an $n \times n$ Attention Score matrix. Each element $(i, j)$ represents the similarity between the Query of the $i$ -th token and the Key of the $j$ -th token.

Step 2: Scaling ( $\frac{1}{\sqrt{d_k}}$ )

As $d_k$ grows larger, the variance of the Dot Product values increases, causing the gradients of Softmax to become extremely small. Specifically, if each component of $q$ and $k$ is an independent random variable with mean 0 and variance 1, the variance of $q \cdot k$ is $d_k$ . Dividing by $\sqrt{d_k}$ normalizes the variance to 1, allowing the Softmax to operate stably.

The paper also confirmed the importance of this Scaling experimentally: when $d_k$ was small, Additive Attention and Dot-Product Attention performed similarly, but when $d_k$ was large, Dot-Product Attention without Scaling degraded significantly.

Step 3: Softmax

Softmax is applied to the scaled scores to obtain Attention Weights. Since each row sums to 1, these serve as weights for a weighted average over the Values.

Step 4: Weighted Sum with Values

Finally, the matrix multiplication of Attention Weights and Values yields an output for each token that is a vector summing all tokens' Values proportionally to their relevance.

4.2 Masking

In the Decoder's Self-Attention, information from future tokens must be prevented from leaking to the current token. To achieve this, Masked Attention sets the scores at future positions to $-\infty$ before Softmax.

\text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V

Here, $M$ is an Upper Triangular Matrix with $-\infty$ at disallowed positions and $0$ at allowed positions.

5. Multi-Head Attention

5.1 Limitations of Single Attention

Using only a single Attention function forces the model to capture token relationships from only one perspective. For example, in the sentence "The cat sat on the mat because it was tired," it becomes difficult to simultaneously capture the syntactic relationship that "it" refers to "cat" and the semantic relationship that "tired" describes the state of "cat."

5.2 Multi-Head Attention Structure

The paper solved this problem by running multiple Attention functions in parallel.

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h) W^O

where each head is defined as follows.

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The weight matrices for each head are $W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$ , $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$ , and the final output projection is $W^O \in \mathbb{R}^{hd_v \times d_{model}}$ .

5.3 Paper's Configuration

The paper used $h = 8$ heads with $d_k = d_v = d_{model} / h = 64$ . Because the total dimension is divided by the number of heads, the total computational cost of Multi-Head Attention is nearly identical to that of Single-Head Attention.

According to the paper's Ablation Study, BLEU dropped by 0.9 points when only 1 head was used, and when there were too many heads (e.g., 32), $d_k$ became too small, actually hurting performance.

5.4 Three Usage Patterns

Multi-Head Attention is used in three places within the Transformer.

Encoder Self-Attention: Within the Encoder, each token of the input sequence attends to all other tokens. Q, K, and V are all generated from the output of the previous Encoder layer.
Decoder Self-Attention (Masked): Masked Attention within the Decoder that can only reference tokens generated so far.
Encoder-Decoder Attention (Cross-Attention): The Decoder's Query attends to the Encoder's Key and Value. This is the component most similar to the Attention in conventional Seq2Seq models.

6. Positional Encoding

6.1 Necessity

Self-Attention is inherently order-agnostic (permutation invariant). Even if the order of input tokens is shuffled, the Attention output values remain the same (only their order changes). Since word order carries crucial information in natural language, positional information must be explicitly injected.

6.2 Sinusoidal Positional Encoding

The paper proposed Positional Encoding using sine and cosine functions.

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)

Here, $pos$ is the position within the sequence and $i$ is the dimension index. This Encoding is added element-wise to the input Embedding and passed to the model.

6.3 Why Sinusoidal?

There are clear reasons for choosing this function.

Relative Position Representation: For any fixed offset $k$ , $PE_{pos+k}$ can be expressed as a linear transformation of $PE_{pos}$ . This enables the model to easily learn relative positional relationships.

\begin{bmatrix} \sin(pos \cdot \omega + k \cdot \omega) \\ \cos(pos \cdot \omega + k \cdot \omega) \end{bmatrix} = \begin{bmatrix} \cos(k\omega) & \sin(k\omega) \\ -\sin(k\omega) & \cos(k\omega) \end{bmatrix} \begin{bmatrix} \sin(pos \cdot \omega) \\ \cos(pos \cdot \omega) \end{bmatrix}

Generalization without Learning: The encoding can naturally extend to longer sequences not seen during training. When compared with learnable Positional Embeddings, the paper reported that both approaches yielded "nearly identical results," and ultimately chose the Sinusoidal approach for its generalization capability.

Frequency Spectrum: Lower dimensions (smaller $i$ ) have shorter wavelengths for fine-grained position distinction, while higher dimensions have longer wavelengths for encoding broader positional relationships.

7. Full Encoder-Decoder Architecture

7.1 Encoder Structure

The Encoder consists of $N = 6$ identical layers. Each layer has two Sub-layers.

Multi-Head Self-Attention
Position-wise Feed-Forward Network

Residual Connection and Layer Normalization are applied to each Sub-layer.

\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))

7.2 Decoder Structure

The Decoder also consists of $N = 6$ identical layers, but unlike the Encoder, each layer has three Sub-layers.

Masked Multi-Head Self-Attention: Masks future positions to maintain the auto-regressive property.
Multi-Head Cross-Attention: Uses the Encoder's output as Key and Value.
Position-wise Feed-Forward Network

7.3 Overall Flow

The input sequence passes through Embedding + Positional Encoding and enters the Encoder, and the Encoder output after 6 layers is passed to the Decoder's Cross-Attention. The Decoder takes previously generated tokens as input and outputs a probability distribution over the next token, repeating this process until the end-of-sequence token is produced. The output dimension of all Sub-layers is unified at $d_{model} = 512$ .

8. Feed-Forward Network, Layer Normalization, Residual Connection

8.1 Position-wise Feed-Forward Network (FFN)

A Position-wise FFN follows each Attention Sub-layer. "Position-wise" means it is applied independently to each position (token) with shared weights.

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

This is a structure with a ReLU activation function sandwiched between two Linear Transformations. The input and output dimensions are $d_{model} = 512$ , and the inner dimension is $d_{ff} = 2048$ . In other words, it is a Bottleneck structure that expands by a factor of 4 and then contracts back to the original size.

This FFN is equivalent to two 1x1 Convolutions and performs nonlinear transformations on each token, converting the relational information captured by Attention into richer representations.

8.2 Residual Connection

This is a Skip Connection that adds the input of each Sub-layer to its output.

\text{output} = x + \text{Sublayer}(x)

This design, borrowed from ResNet, stabilizes training by allowing gradients to flow smoothly through deep networks. For Residual Connections to work properly, the dimensions of the two tensors being added must be identical, which is why the output dimensions of all Sub-layers and Embeddings are unified at $d_{model} = 512$ .

8.3 Layer Normalization

Layer Normalization is applied to the output of each Sub-layer. Unlike Batch Normalization, Layer Normalization normalizes across all Features within a single sample, making it independent of batch size.

\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma + \epsilon} + \beta

Here, $\mu$ and $\sigma$ are the mean and standard deviation across all dimensions of the layer, and $\gamma$ and $\beta$ are learnable parameters. The paper used the Post-Norm approach (applying LN after the Sublayer output + Residual).

9. Training Strategy

9.1 Optimizer and Learning Rate Schedule

The paper used the Adam Optimizer with a distinctive Learning Rate schedule. This schedule later became widely known as the "Noam Scheduler."

lr = d_{model}^{-0.5} \cdot \min(step^{-0.5}, \; step \cdot warmup\_steps^{-1.5})

The key feature of this schedule is Warmup. During the first $warmup\_steps$ (4,000 steps in the paper), the Learning Rate increases linearly, and afterwards it decreases proportionally to the inverse square root of the step number.

Warmup is necessary because Adam's second moment estimates are unstable in the early stages of training. Keeping the Learning Rate low initially prevents parameters from changing drastically and allows training to begin in earnest once moment estimates have stabilized.

The Adam Optimizer hyperparameters were $\beta_1 = 0.9$ , $\beta_2 = 0.98$ , $\epsilon = 10^{-9}$ . It is noteworthy that $\beta_2$ was set to 0.98, lower than the typical 0.999, which is interpreted as adapting to the rapid changes in Attention Score distributions.

9.2 Regularization

Residual Dropout: Dropout (rate = 0.1) is applied to the output of each Sub-layer before the Residual Connection. Dropout is also applied to the sum of Embedding + Positional Encoding in both the Encoder and Decoder.

Label Smoothing: Label Smoothing with $\epsilon_{ls} = 0.1$ was applied. This technique sets the target probability of the correct class to $1 - \epsilon_{ls}$ rather than 1, and the target probability of other classes to $\epsilon_{ls} / (K - 1)$ . The paper reported that Label Smoothing worsens Perplexity but improves Accuracy and BLEU Score. This is because it prevents the model from becoming overconfident, thereby improving generalization performance.

9.3 Training Data and Hardware

WMT 2014 English-German: Approximately 4.5 million sentence pairs, using Byte-Pair Encoding (BPE) with a shared vocabulary of approximately 37,000 tokens
WMT 2014 English-French: Approximately 36 million sentence pairs, using a 32,000 Word-piece vocabulary
Batch: Containing approximately 25,000 Source tokens + 25,000 Target tokens
Hardware: 8 NVIDIA P100 GPUs
Training Time: Approximately 12 hours for the Base model (100K steps), approximately 3.5 days for the Big model (300K steps)

10. Key Experimental Results

10.1 Machine Translation Performance

Model	EN-DE BLEU	EN-FR BLEU	Training Cost (FLOPs)
Transformer (Base)	27.3	38.1	$3.3 \times 10^{18}$
Transformer (Big)	28.4	41.8	$2.3 \times 10^{19}$
Previous SOTA (including Ensemble)	26.36	41.29	-

The Transformer Big model surpassed the previous best performance on EN-DE by more than 2 BLEU and set a new SOTA on EN-FR as well. What is even more remarkable is that this performance was achieved at a fraction of the training cost of existing models.

10.2 Model Size Comparison

Config	$N$	$d_{model}$	$d_{ff}$	$h$	$d_k$	Parameters
Base	6	512	2048	8	64	65M
Big	6	1024	4096	16	64	213M

10.3 Key Ablation Study Results

The paper's Ablation Study clearly demonstrates the importance of each design decision.

Number of Attention Heads: $h = 1$ resulted in a 0.9 BLEU drop; $h = 16$ or $h = 32$ caused performance degradation because $d_k$ became too small
$d_k$ (Key Dimension): Reducing it led to quality degradation. It directly affects the representational capacity of Dot-Product Attention
$d_{model}$ (Model Dimension): Performance consistently improved with larger values
Dropout: Without it, overfitting occurred with significant performance drops
Positional Encoding: Learnable and Sinusoidal approaches achieved nearly identical performance

10.4 English Constituency Parsing

To verify generalization ability beyond translation, the model was also applied to English Constituency Parsing. It achieved 91.3 F1 using only WSJ data and 92.7 F1 in a semi-supervised setting, showing competitive performance with task-specific models. This demonstrated that the Transformer is a general-purpose sequence model not limited to machine translation.

11. Impact on Subsequent Research

The Transformer architecture has become the foundation for virtually all major advances in modern AI.

11.1 BERT (2018, Google)

Using only the Encoder portion of the Transformer, BERT performed bidirectional pre-training. Through two pre-training tasks — Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) — it achieved SOTA on 11 NLP benchmarks. BERT established the Transfer Learning paradigm in NLP.

11.2 GPT Series (2018~, OpenAI)

Using only the Decoder portion of the Transformer, GPT performed Auto-regressive Language Modeling. Scaling up from GPT-1 (117M) to GPT-2 (1.5B) to GPT-3 (175B) demonstrated the power of Scaling Laws. GPT-3 showcased Few-shot Learning capabilities, opening new possibilities for AI and becoming the starting point for the Large Language Model (LLM) revolution that led to ChatGPT and GPT-4.

11.3 Beyond

T5 (2019): Unified all NLP tasks into a Text-to-Text format, using the full Encoder-Decoder structure
ViT (2020): Applied the Transformer to Computer Vision, dividing images into patches and processing them as sequences
DALL-E, Stable Diffusion: Leveraged Transformers for image generation
AlphaFold 2: Utilized the Attention mechanism for protein structure prediction

A single paper has transformed nearly every field of AI, from NLP to Computer Vision, biology, music, and robotics.

12. Core PyTorch Code Examples

12.1 Scaled Dot-Product Attention

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    mask: torch.Tensor = None,
    dropout: nn.Dropout = None
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Scaled Dot-Product Attention implementation.
    Args:
        query: (batch, h, seq_len, d_k)
        key:   (batch, h, seq_len, d_k)
        value: (batch, h, seq_len, d_v)
        mask:  Attention mask (optional)
    Returns:
        output: (batch, h, seq_len, d_v)
        attention_weights: (batch, h, seq_len, seq_len)
    """
    d_k = query.size(-1)

    # Step 1 & 2: QK^T / sqrt(d_k)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

    # Masking (for Decoder Self-Attention, etc.)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Step 3: Softmax
    attention_weights = F.softmax(scores, dim=-1)

    if dropout is not None:
        attention_weights = dropout(attention_weights)

    # Step 4: Weighted sum of values
    output = torch.matmul(attention_weights, value)
    return output, attention_weights

12.2 Multi-Head Attention

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int = 512, h: int = 8, dropout: float = 0.1):
        super().__init__()
        assert d_model % h == 0, "d_model must be divisible by h"

        self.d_model = d_model
        self.h = h
        self.d_k = d_model // h

        # Linear projections for Q, K, V, and Output
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: torch.Tensor = None
    ) -> torch.Tensor:
        batch_size = query.size(0)

        # 1) Linear projection then reshape to (batch, h, seq_len, d_k)
        Q = self.W_q(query).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

        # 2) Scaled Dot-Product Attention (all heads in parallel)
        attn_output, attn_weights = scaled_dot_product_attention(
            Q, K, V, mask=mask, dropout=self.dropout
        )

        # 3) Concatenate head results: (batch, seq_len, d_model)
        attn_output = (
            attn_output.transpose(1, 2)
            .contiguous()
            .view(batch_size, -1, self.d_model)
        )

        # 4) Final linear projection
        return self.W_o(attn_output)

12.3 Positional Encoding

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int = 512, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        # Create Positional Encoding matrix of size (max_len, d_model)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # (max_len, 1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )  # (d_model/2,)

        pe[:, 0::2] = torch.sin(position * div_term)  # Even dimensions: sin
        pe[:, 1::2] = torch.cos(position * div_term)  # Odd dimensions: cos

        pe = pe.unsqueeze(0)  # (1, max_len, d_model) - add batch dimension
        self.register_buffer('pe', pe)  # Register as buffer, not a learnable parameter

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (batch, seq_len, d_model) - Embedding output
        Returns:
            (batch, seq_len, d_model) - Result with Positional Encoding added
        """
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

12.4 Transformer Encoder Layer

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model: int = 512, h: int = 8, d_ff: int = 2048, dropout: float = 0.1):
        super().__init__()

        # Sub-layer 1: Multi-Head Self-Attention
        self.self_attn = MultiHeadAttention(d_model, h, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)

        # Sub-layer 2: Position-wise FFN
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        # Sub-layer 1: Self-Attention + Residual + LayerNorm
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))

        # Sub-layer 2: FFN + Residual + LayerNorm
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))

        return x

In the code above, self.self_attn(x, x, x, mask) passes the same input x for Q, K, and V, which is why it is called "Self"-Attention. For Cross-Attention, the Encoder output would be passed for K and V.

13. Conclusion

"Attention Is All You Need" is not merely a paper proposing a single machine translation model. It is a paper that broke the long-standing inertia of Recurrence and empirically proved the bold claim that Attention alone is sufficient.

The core contributions of this paper can be summarized as follows.

Elimination of Recurrence: Dramatically improved training speed with a parallelizable architecture
Self-Attention: Directly models relationships between all tokens in a sequence with $O(1)$ path length
Multi-Head Attention: Captures relationships simultaneously from multiple perspectives
Scalability: A simple yet scalable architectural design that enables scaling up to billions and even trillions of parameters

Since its publication in 2017, the Transformer has spread from NLP to Vision, Audio, Biology, Robotics, and nearly every other domain of AI. True to the paper's title, Attention really was All You Need.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762
Full paper (HTML version): https://arxiv.org/html/1706.03762v7
NeurIPS Official PDF: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf
Jay Alammar, The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
Harvard NLP, The Annotated Transformer: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers. https://arxiv.org/abs/1810.04805
Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-Training (GPT). https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
UvA Deep Learning Tutorials - Transformers and Multi-Head Attention: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html
Wikipedia - Attention Is All You Need: https://en.wikipedia.org/wiki/Attention_Is_All_You_Need