💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

1. Paper Overview

"Attention Is All You Need" is a paper presented at NeurIPS 2017, co-authored by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin from Google Brain and Google Research. This paper demonstrated that a Sequence-to-Sequence model could be constructed using **only the Attention mechanism**, completely eliminating traditional Recurrence and Convolution — a true turning point in the history of deep learning.

The **Transformer** architecture proposed in the paper achieved 28.4 BLEU on the WMT 2014 English-to-German translation task and 41.8 BLEU on English-to-French, surpassing all existing models. More importantly, this architecture subsequently became the foundation for virtually all major modern AI models, including BERT, GPT, T5, and ViT.

2. Background and Motivation: The Limitations of RNN/LSTM

2.1 The Bottleneck of Sequential Processing

Before the Transformer, the standard for Sequence Modeling was RNN (Recurrent Neural Network) and its variants LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit). These architectures process sequences in order $t = 1, 2, ..., n$, updating the hidden state $h_t$ at each step.

h_t = f(h_{t-1}, x_t)

This sequential nature gave rise to two fundamental problems.

**First, parallelization was impossible.** Since the computation at each time step depends on the result of the previous step, the parallel processing capabilities of GPUs could not be effectively utilized. Training time increased linearly as sequence length grew.

**Second, the Long-range Dependency problem.** Although LSTM was theoretically capable of learning long-term dependencies, in practice it became increasingly difficult to capture relationships between distant tokens as sequences grew longer. This is because all past information must be compressed into a fixed-size vector called the hidden state.

2.2 The Emergence and Limitations of Attention

The Attention mechanism proposed by Bahdanau et al. (2014) greatly alleviated the Long-range Dependency problem by allowing the Decoder to directly access all of the Encoder's hidden states. However, since Attention was still **added on top of** RNNs, the bottleneck of Sequential Processing remained.

The paper's core question was precisely this: **"Is Attention alone sufficient, without Recurrence?"**

The answer was Yes, and the result was the Transformer.

3. Self-Attention Mechanism

3.1 Core Concept: Query, Key, Value

The key idea behind Self-Attention is that each token in a sequence directly computes its relationship with every other token. To achieve this, each input vector is transformed into three roles.

- **Query (Q)**: "What information am I looking for?"

- **Key (K)**: "What is the identifier of the information I can provide?"

- **Value (V)**: "What is the actual information I convey?"

Given an input sequence $X \in \mathbb{R}^{n \times d_{model}}$, Q, K, and V are generated through learnable weight matrices.

Q = XW^Q, \quad K = XW^K, \quad V = XW^V

where $W^Q, W^K \in \mathbb{R}^{d_{model} \times d_k}$ and $W^V \in \mathbb{R}^{d_{model} \times d_v}$.

3.2 Intuitive Understanding

An information retrieval analogy makes this easier to understand. Imagine searching for a book in a library: the Query is the search term "deep learning introductory book," the Key is each book's title or tag, and the Value is the actual content of the book. The essence of Self-Attention is retrieving more Value from books whose Key has higher similarity to the Query.

The decisive difference between Self-Attention and RNNs is that the path length between any two tokens in the sequence is always $O(1)$. For RNNs it is $O(n)$, and for CNNs it is $O(\log_k n)$ (dilated) or $O(n/k)$ (standard). This short path length is what enables effective learning of Long-range Dependencies.

4. Scaled Dot-Product Attention

4.1 Formula

The exact formula for the Attention function proposed in the paper is as follows.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Let us decompose this formula step by step.

**Step 1: Similarity Computation ($QK^T$)**

The Dot Product of Query and Key is computed. The result is an $n \times n$ Attention Score matrix. Each element $(i, j)$ represents the similarity between the Query of the $i$-th token and the Key of the $j$-th token.

**Step 2: Scaling ($\frac{1}{\sqrt{d_k}}$)**

As $d_k$ grows larger, the variance of the Dot Product values increases, causing the gradients of Softmax to become extremely small. Specifically, if each component of $q$ and $k$ is an independent random variable with mean 0 and variance 1, the variance of $q \cdot k$ is $d_k$. Dividing by $\sqrt{d_k}$ normalizes the variance to 1, allowing the Softmax to operate stably.

The paper also confirmed the importance of this Scaling experimentally: when $d_k$ was small, Additive Attention and Dot-Product Attention performed similarly, but when $d_k$ was large, Dot-Product Attention without Scaling degraded significantly.

**Step 3: Softmax**

Softmax is applied to the scaled scores to obtain Attention Weights. Since each row sums to 1, these serve as weights for a weighted average over the Values.

**Step 4: Weighted Sum with Values**

Finally, the matrix multiplication of Attention Weights and Values yields an output for each token that is a vector summing all tokens' Values proportionally to their relevance.

4.2 Masking

In the Decoder's Self-Attention, information from future tokens must be prevented from leaking to the current token. To achieve this, **Masked Attention** sets the scores at future positions to $-\infty$ before Softmax.

\text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V

Here, $M$ is an Upper Triangular Matrix with $-\infty$ at disallowed positions and $0$ at allowed positions.

5. Multi-Head Attention

5.1 Limitations of Single Attention

Using only a single Attention function forces the model to capture token relationships from only one perspective. For example, in the sentence "The cat sat on the mat because it was tired," it becomes difficult to simultaneously capture the syntactic relationship that "it" refers to "cat" and the semantic relationship that "tired" describes the state of "cat."

5.2 Multi-Head Attention Structure

The paper solved this problem by **running multiple Attention functions in parallel**.

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h) W^O

where each head is defined as follows.

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The weight matrices for each head are $W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$, and the final output projection is $W^O \in \mathbb{R}^{hd_v \times d_{model}}$.

5.3 Paper's Configuration

The paper used $h = 8$ heads with $d_k = d_v = d_{model} / h = 64$. Because the total dimension is divided by the number of heads, the total computational cost of Multi-Head Attention is nearly identical to that of Single-Head Attention.

According to the paper's Ablation Study, BLEU dropped by 0.9 points when only 1 head was used, and when there were too many heads (e.g., 32), $d_k$ became too small, actually hurting performance.

5.4 Three Usage Patterns

Multi-Head Attention is used in three places within the Transformer.

1. **Encoder Self-Attention**: Within the Encoder, each token of the input sequence attends to all other tokens. Q, K, and V are all generated from the output of the previous Encoder layer.

2. **Decoder Self-Attention (Masked)**: Masked Attention within the Decoder that can only reference tokens generated so far.

3. **Encoder-Decoder Attention (Cross-Attention)**: The Decoder's Query attends to the Encoder's Key and Value. This is the component most similar to the Attention in conventional Seq2Seq models.

6. Positional Encoding

6.1 Necessity

Self-Attention is inherently order-agnostic (permutation invariant). Even if the order of input tokens is shuffled, the Attention output values remain the same (only their order changes). Since word order carries crucial information in natural language, positional information must be explicitly injected.

6.2 Sinusoidal Positional Encoding

The paper proposed Positional Encoding using sine and cosine functions.

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)

Here, $pos$ is the position within the sequence and $i$ is the dimension index. This Encoding is added element-wise to the input Embedding and passed to the model.

6.3 Why Sinusoidal?

There are clear reasons for choosing this function.

**Relative Position Representation**: For any fixed offset $k$, $PE_{pos+k}$ can be expressed as a linear transformation of $PE_{pos}$. This enables the model to easily learn relative positional relationships.

\begin{bmatrix} \sin(pos \cdot \omega + k \cdot \omega) \\ \cos(pos \cdot \omega + k \cdot \omega) \end{bmatrix} = \begin{bmatrix} \cos(k\omega) & \sin(k\omega) \\ -\sin(k\omega) & \cos(k\omega) \end{bmatrix} \begin{bmatrix} \sin(pos \cdot \omega) \\ \cos(pos \cdot \omega) \end{bmatrix}

**Generalization without Learning**: The encoding can naturally extend to longer sequences not seen during training. When compared with learnable Positional Embeddings, the paper reported that both approaches yielded "nearly identical results," and ultimately chose the Sinusoidal approach for its generalization capability.

**Frequency Spectrum**: Lower dimensions (smaller $i$) have shorter wavelengths for fine-grained position distinction, while higher dimensions have longer wavelengths for encoding broader positional relationships.

7. Full Encoder-Decoder Architecture

7.1 Encoder Structure

The Encoder consists of $N = 6$ identical layers. Each layer has two Sub-layers.

1. **Multi-Head Self-Attention**

2. **Position-wise Feed-Forward Network**

Residual Connection and Layer Normalization are applied to each Sub-layer.

\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))

7.2 Decoder Structure

The Decoder also consists of $N = 6$ identical layers, but unlike the Encoder, each layer has three Sub-layers.

1. **Masked Multi-Head Self-Attention**: Masks future positions to maintain the auto-regressive property.

2. **Multi-Head Cross-Attention**: Uses the Encoder's output as Key and Value.

3. **Position-wise Feed-Forward Network**

7.3 Overall Flow

The input sequence passes through Embedding + Positional Encoding and enters the Encoder, and the Encoder output after 6 layers is passed to the Decoder's Cross-Attention. The Decoder takes previously generated tokens as input and outputs a probability distribution over the next token, repeating this process until the end-of-sequence token is produced. The output dimension of all Sub-layers is unified at $d_{model} = 512$.

8. Feed-Forward Network, Layer Normalization, Residual Connection

8.1 Position-wise Feed-Forward Network (FFN)

A Position-wise FFN follows each Attention Sub-layer. "Position-wise" means it is applied independently to each position (token) with shared weights.

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

This is a structure with a ReLU activation function sandwiched between two Linear Transformations. The input and output dimensions are $d_{model} = 512$, and the inner dimension is $d_{ff} = 2048$. In other words, it is a Bottleneck structure that expands by a factor of 4 and then contracts back to the original size.

This FFN is equivalent to two 1x1 Convolutions and performs nonlinear transformations on each token, converting the relational information captured by Attention into richer representations.

8.2 Residual Connection

This is a Skip Connection that adds the input of each Sub-layer to its output.

\text{output} = x + \text{Sublayer}(x)

This design, borrowed from ResNet, stabilizes training by allowing gradients to flow smoothly through deep networks. For Residual Connections to work properly, the dimensions of the two tensors being added must be identical, which is why the output dimensions of all Sub-layers and Embeddings are unified at $d_{model} = 512$.

8.3 Layer Normalization

Layer Normalization is applied to the output of each Sub-layer. Unlike Batch Normalization, Layer Normalization normalizes across all Features within a single sample, making it independent of batch size.

\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma + \epsilon} + \beta

Here, $\mu$ and $\sigma$ are the mean and standard deviation across all dimensions of the layer, and $\gamma$ and $\beta$ are learnable parameters. The paper used the **Post-Norm** approach (applying LN after the Sublayer output + Residual).

9. Training Strategy

9.1 Optimizer and Learning Rate Schedule

The paper used the Adam Optimizer with a distinctive Learning Rate schedule. This schedule later became widely known as the "Noam Scheduler."

lr = d_{model}^{-0.5} \cdot \min(step^{-0.5}, \; step \cdot warmup\_steps^{-1.5})

The key feature of this schedule is **Warmup**. During the first $warmup\_steps$ (4,000 steps in the paper), the Learning Rate increases linearly, and afterwards it decreases proportionally to the inverse square root of the step number.

Warmup is necessary because Adam's second moment estimates are unstable in the early stages of training. Keeping the Learning Rate low initially prevents parameters from changing drastically and allows training to begin in earnest once moment estimates have stabilized.

The Adam Optimizer hyperparameters were $\beta_1 = 0.9$, $\beta_2 = 0.98$, $\epsilon = 10^{-9}$. It is noteworthy that $\beta_2$ was set to 0.98, lower than the typical 0.999, which is interpreted as adapting to the rapid changes in Attention Score distributions.

9.2 Regularization

**Residual Dropout**: Dropout (rate = 0.1) is applied to the output of each Sub-layer before the Residual Connection. Dropout is also applied to the sum of Embedding + Positional Encoding in both the Encoder and Decoder.

**Label Smoothing**: Label Smoothing with $\epsilon_{ls} = 0.1$ was applied. This technique sets the target probability of the correct class to $1 - \epsilon_{ls}$ rather than 1, and the target probability of other classes to $\epsilon_{ls} / (K - 1)$. The paper reported that Label Smoothing worsens Perplexity but improves Accuracy and BLEU Score. This is because it prevents the model from becoming overconfident, thereby improving generalization performance.

9.3 Training Data and Hardware

- **WMT 2014 English-German**: Approximately 4.5 million sentence pairs, using Byte-Pair Encoding (BPE) with a shared vocabulary of approximately 37,000 tokens

- **WMT 2014 English-French**: Approximately 36 million sentence pairs, using a 32,000 Word-piece vocabulary

- **Batch**: Containing approximately 25,000 Source tokens + 25,000 Target tokens

- **Hardware**: 8 NVIDIA P100 GPUs

- **Training Time**: Approximately 12 hours for the Base model (100K steps), approximately 3.5 days for the Big model (300K steps)

10. Key Experimental Results

10.1 Machine Translation Performance

| ---------------------------------- | ---------- | ---------- | --------------------- |

| Transformer (Base) | 27.3 | 38.1 | $3.3 \times 10^{18}$ |

| Transformer (Big) | **28.4** | **41.8** | $2.3 \times 10^{19}$ |

| Previous SOTA (including Ensemble) | 26.36 | 41.29 | - |

The Transformer Big model surpassed the previous best performance on EN-DE by **more than 2 BLEU** and set a new SOTA on EN-FR as well. What is even more remarkable is that this performance was achieved at a **fraction** of the training cost of existing models.

10.2 Model Size Comparison

| Config | $N$ | $d_{model}$ | $d_{ff}$ | $h$ | $d_k$ | Parameters |

| ------ | --- | ----------- | -------- | --- | ----- | ---------- |

| Base | 6 | 512 | 2048 | 8 | 64 | 65M |

| Big | 6 | 1024 | 4096 | 16 | 64 | 213M |

10.3 Key Ablation Study Results

The paper's Ablation Study clearly demonstrates the importance of each design decision.

- **Number of Attention Heads**: $h = 1$ resulted in a 0.9 BLEU drop; $h = 16$ or $h = 32$ caused performance degradation because $d_k$ became too small

- **$d_k$ (Key Dimension)**: Reducing it led to quality degradation. It directly affects the representational capacity of Dot-Product Attention

- **$d_{model}$ (Model Dimension)**: Performance consistently improved with larger values

- **Dropout**: Without it, overfitting occurred with significant performance drops

- **Positional Encoding**: Learnable and Sinusoidal approaches achieved nearly identical performance

10.4 English Constituency Parsing

To verify generalization ability beyond translation, the model was also applied to English Constituency Parsing. It achieved 91.3 F1 using only WSJ data and 92.7 F1 in a semi-supervised setting, showing competitive performance with task-specific models. This demonstrated that the Transformer is a general-purpose sequence model not limited to machine translation.

11. Impact on Subsequent Research

The Transformer architecture has become the foundation for virtually all major advances in modern AI.

11.1 BERT (2018, Google)

Using only the **Encoder** portion of the Transformer, BERT performed bidirectional pre-training. Through two pre-training tasks — Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) — it achieved SOTA on 11 NLP benchmarks. BERT established the Transfer Learning paradigm in NLP.

11.2 GPT Series (2018~, OpenAI)

Using only the **Decoder** portion of the Transformer, GPT performed Auto-regressive Language Modeling. Scaling up from GPT-1 (117M) to GPT-2 (1.5B) to GPT-3 (175B) demonstrated the power of Scaling Laws. GPT-3 showcased Few-shot Learning capabilities, opening new possibilities for AI and becoming the starting point for the Large Language Model (LLM) revolution that led to ChatGPT and GPT-4.

11.3 Beyond

- **T5 (2019)**: Unified all NLP tasks into a Text-to-Text format, using the full Encoder-Decoder structure

- **ViT (2020)**: Applied the Transformer to Computer Vision, dividing images into patches and processing them as sequences

- **DALL-E, Stable Diffusion**: Leveraged Transformers for image generation

- **AlphaFold 2**: Utilized the Attention mechanism for protein structure prediction

A single paper has transformed nearly every field of AI, from NLP to Computer Vision, biology, music, and robotics.

12. Core PyTorch Code Examples

12.1 Scaled Dot-Product Attention

def scaled_dot_product_attention(

query: torch.Tensor,

key: torch.Tensor,

value: torch.Tensor,

mask: torch.Tensor = None,

dropout: nn.Dropout = None

) -> tuple[torch.Tensor, torch.Tensor]:

"""

Scaled Dot-Product Attention implementation.

Args:

query: (batch, h, seq_len, d_k)

key: (batch, h, seq_len, d_k)

value: (batch, h, seq_len, d_v)

mask: Attention mask (optional)

Returns:

output: (batch, h, seq_len, d_v)

attention_weights: (batch, h, seq_len, seq_len)

"""

d_k = query.size(-1)

Step 1 & 2: QK^T / sqrt(d_k)

scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

Masking (for Decoder Self-Attention, etc.)

if mask is not None:

scores = scores.masked_fill(mask == 0, float('-inf'))

Step 3: Softmax

attention_weights = F.softmax(scores, dim=-1)

if dropout is not None:

attention_weights = dropout(attention_weights)

Step 4: Weighted sum of values

output = torch.matmul(attention_weights, value)

return output, attention_weights

12.2 Multi-Head Attention

class MultiHeadAttention(nn.Module):

def __init__(self, d_model: int = 512, h: int = 8, dropout: float = 0.1):

super().__init__()

assert d_model % h == 0, "d_model must be divisible by h"

self.d_model = d_model

self.h = h

self.d_k = d_model // h

Linear projections for Q, K, V, and Output

self.W_q = nn.Linear(d_model, d_model)

self.W_k = nn.Linear(d_model, d_model)

self.W_v = nn.Linear(d_model, d_model)

self.W_o = nn.Linear(d_model, d_model)

self.dropout = nn.Dropout(dropout)

def forward(

self,

query: torch.Tensor,

key: torch.Tensor,

value: torch.Tensor,

mask: torch.Tensor = None

) -> torch.Tensor:

batch_size = query.size(0)

1) Linear projection then reshape to (batch, h, seq_len, d_k)

Q = self.W_q(query).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

K = self.W_k(key).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

V = self.W_v(value).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

2) Scaled Dot-Product Attention (all heads in parallel)

attn_output, attn_weights = scaled_dot_product_attention(

Q, K, V, mask=mask, dropout=self.dropout

)

3) Concatenate head results: (batch, seq_len, d_model)

attn_output = (

attn_output.transpose(1, 2)

.contiguous()

.view(batch_size, -1, self.d_model)

)

4) Final linear projection

return self.W_o(attn_output)

12.3 Positional Encoding

class PositionalEncoding(nn.Module):

def __init__(self, d_model: int = 512, max_len: int = 5000, dropout: float = 0.1):

super().__init__()

self.dropout = nn.Dropout(dropout)

Create Positional Encoding matrix of size (max_len, d_model)

pe = torch.zeros(max_len, d_model)

position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) # (max_len, 1)

div_term = torch.exp(

torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)

) # (d_model/2,)

pe[:, 0::2] = torch.sin(position * div_term) # Even dimensions: sin

pe[:, 1::2] = torch.cos(position * div_term) # Odd dimensions: cos

pe = pe.unsqueeze(0) # (1, max_len, d_model) - add batch dimension

self.register_buffer('pe', pe) # Register as buffer, not a learnable parameter

def forward(self, x: torch.Tensor) -> torch.Tensor:

"""

Args:

x: (batch, seq_len, d_model) - Embedding output

Returns:

(batch, seq_len, d_model) - Result with Positional Encoding added

"""

x = x + self.pe[:, :x.size(1), :]

return self.dropout(x)

12.4 Transformer Encoder Layer

class TransformerEncoderLayer(nn.Module):

def __init__(self, d_model: int = 512, h: int = 8, d_ff: int = 2048, dropout: float = 0.1):

super().__init__()

Sub-layer 1: Multi-Head Self-Attention

self.self_attn = MultiHeadAttention(d_model, h, dropout)

self.norm1 = nn.LayerNorm(d_model)

self.dropout1 = nn.Dropout(dropout)

Sub-layer 2: Position-wise FFN

self.ffn = nn.Sequential(

nn.Linear(d_model, d_ff),

nn.ReLU(),

nn.Dropout(dropout),

nn.Linear(d_ff, d_model),

)

self.norm2 = nn.LayerNorm(d_model)

self.dropout2 = nn.Dropout(dropout)

def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:

Sub-layer 1: Self-Attention + Residual + LayerNorm

attn_output = self.self_attn(x, x, x, mask)

x = self.norm1(x + self.dropout1(attn_output))

Sub-layer 2: FFN + Residual + LayerNorm

ffn_output = self.ffn(x)

x = self.norm2(x + self.dropout2(ffn_output))

return x

In the code above, `self.self_attn(x, x, x, mask)` passes the same input `x` for Q, K, and V, which is why it is called "Self"-Attention. For Cross-Attention, the Encoder output would be passed for K and V.

13. Conclusion

"Attention Is All You Need" is not merely a paper proposing a single machine translation model. It is a paper that broke the long-standing inertia of Recurrence and empirically proved the bold claim that Attention alone is sufficient.

The core contributions of this paper can be summarized as follows.

1. **Elimination of Recurrence**: Dramatically improved training speed with a parallelizable architecture

2. **Self-Attention**: Directly models relationships between all tokens in a sequence with $O(1)$ path length

3. **Multi-Head Attention**: Captures relationships simultaneously from multiple perspectives

4. **Scalability**: A simple yet scalable architectural design that enables scaling up to billions and even trillions of parameters

Since its publication in 2017, the Transformer has spread from NLP to Vision, Audio, Biology, Robotics, and nearly every other domain of AI. True to the paper's title, Attention really was All You Need.

References

- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). _Attention Is All You Need_. NeurIPS 2017. [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)

- Full paper (HTML version): [https://arxiv.org/html/1706.03762v7](https://arxiv.org/html/1706.03762v7)

- NeurIPS Official PDF: [https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf](https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf)

- Jay Alammar, _The Illustrated Transformer_: [https://jalammar.github.io/illustrated-transformer/](https://jalammar.github.io/illustrated-transformer/)

- Harvard NLP, _The Annotated Transformer_: [http://nlp.seas.harvard.edu/2018/04/03/attention.html](http://nlp.seas.harvard.edu/2018/04/03/attention.html)

- Devlin, J. et al. (2018). _BERT: Pre-training of Deep Bidirectional Transformers_. [https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805)

- Radford, A. et al. (2018). _Improving Language Understanding by Generative Pre-Training_ (GPT). [https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)

- UvA Deep Learning Tutorials - Transformers and Multi-Head Attention: [https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html)

- Wikipedia - _Attention Is All You Need_: [https://en.wikipedia.org/wiki/Attention_Is_All_You_Need](https://en.wikipedia.org/wiki/Attention_Is_All_You_Need)