Split View: Attention Is All You Need - Transformer 논문 완전 분석

Attention Is All You Need - Transformer 논문 완전 분석

1. 논문 개요
2. 논문의 배경 및 동기: RNN/LSTM의 한계
- 2.1 Sequential Processing의 병목
- 2.2 Attention의 등장과 한계
3. Self-Attention 메커니즘
- 3.1 핵심 개념: Query, Key, Value
- 3.2 직관적 이해
4. Scaled Dot-Product Attention
- 4.1 수식
- 4.2 Masking
5. Multi-Head Attention
6. Positional Encoding
7. Encoder-Decoder 전체 아키텍처
8. Feed-Forward Network, Layer Normalization, Residual Connection
9. 학습 전략
10. 핵심 실험 결과
11. 후속 연구에 대한 영향
12. PyTorch 핵심 코드 예시
13. 마무리
References

1. 논문 개요

"Attention Is All You Need"는 2017년 NeurIPS에서 발표된 논문으로, Google Brain과 Google Research 소속의 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, Illia Polosukhin이 공동 저술했다. 이 논문은 기존의 Recurrence와 Convolution을 완전히 배제하고 Attention 메커니즘만으로 Sequence-to-Sequence 모델을 구성할 수 있음을 보여준, 말 그대로 딥러닝 역사의 전환점이 된 연구다.

논문이 제안한 Transformer 아키텍처는 WMT 2014 English-to-German 번역 태스크에서 28.4 BLEU, English-to-French에서 41.8 BLEU를 달성하며 기존 모든 모델을 능가했다. 더 중요한 것은, 이 아키텍처가 이후 BERT, GPT, T5, ViT 등 현대 AI의 거의 모든 주요 모델의 기반이 되었다는 점이다.

2. 논문의 배경 및 동기: RNN/LSTM의 한계

2.1 Sequential Processing의 병목

Transformer 이전에 Sequence Modeling의 표준은 RNN(Recurrent Neural Network)과 그 변형인 LSTM(Long Short-Term Memory), GRU(Gated Recurrent Unit)였다. 이들은 시퀀스를 $t = 1, 2, ..., n$ 순서대로 처리하면서 hidden state $h_t$ 를 업데이트하는 구조를 가진다.

h_t = f(h_{t-1}, x_t)

이 순차적 특성은 두 가지 근본적인 문제를 야기했다.

첫째, 병렬화 불가능. 각 시점의 계산이 이전 시점의 결과에 의존하기 때문에, GPU의 병렬 처리 능력을 효과적으로 활용할 수 없었다. 시퀀스 길이가 길어질수록 학습 시간이 선형적으로 증가했다.

둘째, Long-range Dependency 문제. 이론적으로 LSTM이 장기 의존성을 학습할 수 있다고 했지만, 실제로는 시퀀스가 길어질수록 먼 거리의 토큰 간 관계를 포착하기 어려웠다. Hidden state라는 고정 크기 벡터에 모든 과거 정보를 압축해야 하기 때문이다.

2.2 Attention의 등장과 한계

Bahdanau et al.(2014)이 제안한 Attention 메커니즘은 Decoder가 Encoder의 모든 hidden state에 직접 접근할 수 있게 하여 Long-range Dependency 문제를 크게 완화했다. 하지만 여전히 RNN 위에 Attention을 추가하는 형태였기 때문에, Sequential Processing의 병목은 그대로였다.

논문의 핵심 질문은 바로 이것이었다: "Recurrence 없이, Attention만으로 충분한가?"

답은 Yes였고, 그 결과물이 Transformer다.

3. Self-Attention 메커니즘

3.1 핵심 개념: Query, Key, Value

Self-Attention의 핵심 아이디어는 시퀀스 내의 각 토큰이 다른 모든 토큰과의 관계를 직접 계산한다는 것이다. 이를 위해 각 입력 벡터를 세 가지 역할로 변환한다.

Query (Q): "나는 어떤 정보를 찾고 있는가?"
Key (K): "내가 제공할 수 있는 정보의 식별자는 무엇인가?"
Value (V): "내가 실제로 전달하는 정보는 무엇인가?"

입력 시퀀스 $X \in \mathbb{R}^{n \times d_{model}}$ 에 대해, 학습 가능한 가중치 행렬을 통해 Q, K, V를 생성한다.

Q = XW^Q, \quad K = XW^K, \quad V = XW^V

여기서 $W^Q, W^K \in \mathbb{R}^{d_{model} \times d_k}$ , $W^V \in \mathbb{R}^{d_{model} \times d_v}$ 이다.

3.2 직관적 이해

정보 검색 시스템에 비유하면 이해가 쉽다. 도서관에서 책을 찾는다고 할 때, Query는 "딥러닝 입문서"라는 검색어이고, Key는 각 책의 제목이나 태그이며, Value는 책의 실제 내용이다. Query와 Key의 유사도가 높은 책의 Value를 더 많이 가져오는 것이 Self-Attention의 본질이다.

Self-Attention이 RNN과 결정적으로 다른 점은, 시퀀스 내 임의의 두 토큰 사이의 경로 길이(path length)가 항상 $O(1)$ 이라는 것이다. RNN은 $O(n)$ 이고, CNN은 $O(\log_k n)$ (dilated) 또는 $O(n/k)$ (일반)이다. 이 짧은 경로 길이 덕분에 Long-range Dependency를 효과적으로 학습할 수 있다.

4. Scaled Dot-Product Attention

4.1 수식

논문에서 제안한 Attention 함수의 정확한 수식은 다음과 같다.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

이 수식을 단계별로 분해해보자.

Step 1: 유사도 계산 ( $QK^T$ )

Query와 Key의 Dot Product를 계산한다. 결과는 $n \times n$ 크기의 Attention Score 행렬이다. 각 원소 $(i, j)$ 는 $i$ 번째 토큰의 Query와 $j$ 번째 토큰의 Key 사이의 유사도를 나타낸다.

Step 2: Scaling ( $\frac{1}{\sqrt{d_k}}$ )

$d_k$ 가 클수록 Dot Product 값의 분산이 커져서 Softmax의 기울기가 매우 작아지는 문제가 발생한다. 구체적으로, $q$ 와 $k$ 의 각 성분이 평균 0, 분산 1인 독립 확률 변수라면, $q \cdot k$ 의 분산은 $d_k$ 가 된다. $\sqrt{d_k}$ 로 나누면 분산이 1로 정규화되어 Softmax가 안정적으로 동작한다.

논문에서 이 Scaling의 중요성을 실험적으로도 확인했는데, $d_k$ 가 작을 때는 Additive Attention과 Dot-Product Attention의 성능이 비슷하지만, $d_k$ 가 클 때 Scaling 없는 Dot-Product Attention의 성능이 크게 떨어졌다.

Step 3: Softmax

Scaling된 Score에 Softmax를 적용하여 Attention Weight을 얻는다. 각 행의 합이 1이 되므로, 이는 Value에 대한 가중 평균의 가중치 역할을 한다.

Step 4: Value와의 가중합

최종적으로 Attention Weight과 Value를 행렬곱하면, 각 토큰의 출력은 모든 토큰의 Value를 관련성에 비례하여 합산한 벡터가 된다.

4.2 Masking

Decoder의 Self-Attention에서는 미래 토큰의 정보가 현재 토큰에 누출되는 것을 방지해야 한다. 이를 위해 Softmax 이전에 미래 위치에 해당하는 Score를 $-\infty$ 로 설정하는 Masked Attention을 사용한다.

\text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V

여기서 $M$ 은 Upper Triangular Matrix로, 허용되지 않는 위치에 $-\infty$ 를, 허용되는 위치에 $0$ 을 가진다.

5. Multi-Head Attention

5.1 단일 Attention의 한계

하나의 Attention 함수만 사용하면, 모델이 단 하나의 관점에서만 토큰 간 관계를 파악하게 된다. 예를 들어, "The cat sat on the mat because it was tired"라는 문장에서 "it"이 "cat"을 가리킨다는 구문적 관계와, "tired"가 "cat"의 상태를 설명한다는 의미적 관계를 동시에 포착하기 어렵다.

5.2 Multi-Head Attention 구조

논문은 이 문제를 여러 개의 Attention을 병렬로 실행하는 것으로 해결했다.

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h) W^O

여기서 각 head는 다음과 같다.

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

각 head의 가중치 행렬은 $W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$ , $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$ 이고, 최종 출력 프로젝션은 $W^O \in \mathbb{R}^{hd_v \times d_{model}}$ 이다.

5.3 논문의 설정

논문에서는 $h = 8$ 개의 head를 사용하고, $d_k = d_v = d_{model} / h = 64$ 로 설정했다. 전체 차원을 head 수로 나누기 때문에, Multi-Head Attention의 총 계산 비용은 Single-Head Attention과 거의 동일하다.

논문의 Ablation Study에 따르면, head 수가 1개일 때 BLEU가 0.9 포인트 하락했고, head가 너무 많으면(예: 32개) $d_k$ 가 지나치게 작아져서 오히려 성능이 떨어졌다.

5.4 세 가지 사용 방식

Transformer에서 Multi-Head Attention은 세 곳에서 사용된다.

Encoder Self-Attention: Encoder 내에서 입력 시퀀스의 각 토큰이 다른 모든 토큰을 참조한다. Q, K, V 모두 이전 Encoder 레이어의 출력에서 생성된다.
Decoder Self-Attention (Masked): Decoder 내에서 현재까지 생성된 토큰들만 참조할 수 있도록 Masking된 Attention이다.
Encoder-Decoder Attention (Cross-Attention): Decoder의 Query가 Encoder의 Key, Value를 참조한다. 이것이 기존 Seq2Seq 모델의 Attention과 가장 유사한 부분이다.

6. Positional Encoding

6.1 필요성

Self-Attention은 본질적으로 순서에 무관(permutation invariant)하다. 입력 토큰의 순서를 바꿔도 Attention의 출력은 (순서만 바뀔 뿐) 동일한 값을 가진다. 자연어에서 어순은 매우 중요한 정보이므로, 위치 정보를 명시적으로 주입해야 한다.

6.2 Sinusoidal Positional Encoding

논문은 사인과 코사인 함수를 이용한 Positional Encoding을 제안했다.

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)

여기서 $pos$ 는 시퀀스 내 위치, $i$ 는 차원 인덱스다. 이 Encoding은 입력 Embedding에 더해져서(element-wise addition) 모델에 전달된다.

6.3 왜 Sinusoidal인가?

이 함수가 선택된 데에는 명확한 이유가 있다.

상대적 위치 표현: 임의의 고정 오프셋 $k$ 에 대해 $PE_{pos+k}$ 는 $PE_{pos}$ 의 선형 변환으로 표현될 수 있다. 이는 모델이 상대적 위치 관계를 쉽게 학습할 수 있게 해준다.

\begin{bmatrix} \sin(pos \cdot \omega + k \cdot \omega) \\ \cos(pos \cdot \omega + k \cdot \omega) \end{bmatrix} = \begin{bmatrix} \cos(k\omega) & \sin(k\omega) \\ -\sin(k\omega) & \cos(k\omega) \end{bmatrix} \begin{bmatrix} \sin(pos \cdot \omega) \\ \cos(pos \cdot \omega) \end{bmatrix}

학습 없는 일반화: 학습 데이터에 없는 더 긴 시퀀스에도 자연스럽게 확장할 수 있다. 학습 가능한 Positional Embedding과 비교했을 때, 논문에서는 두 방식이 "거의 동일한 결과"를 보였다고 보고했으며, 일반화 가능성 때문에 Sinusoidal 방식을 최종 선택했다.

주파수 스펙트럼: 차원이 낮을수록( $i$ 가 작을수록) 파장이 짧아 세밀한 위치 구분을, 차원이 높을수록 파장이 길어 넓은 범위의 위치 관계를 인코딩한다.

7. Encoder-Decoder 전체 아키텍처

7.1 Encoder 구조

Encoder는 $N = 6$ 개의 동일한 레이어로 구성된다. 각 레이어는 두 개의 Sub-layer를 가진다.

Multi-Head Self-Attention
Position-wise Feed-Forward Network

각 Sub-layer에는 Residual Connection과 Layer Normalization이 적용된다.

\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))

7.2 Decoder 구조

Decoder 역시 $N = 6$ 개의 동일한 레이어로 구성되지만, Encoder와 달리 세 개의 Sub-layer를 가진다.

Masked Multi-Head Self-Attention: Auto-regressive 속성을 유지하기 위해 미래 위치를 마스킹한다.
Multi-Head Cross-Attention: Encoder의 출력을 Key, Value로 사용한다.
Position-wise Feed-Forward Network

7.3 전체 흐름

입력 시퀀스가 Embedding + Positional Encoding을 거쳐 Encoder에 들어가고, 6개의 레이어를 통과한 Encoder 출력이 Decoder의 Cross-Attention에 전달된다. Decoder는 이전에 생성된 토큰들을 입력으로 받아 다음 토큰의 확률 분포를 출력하며, 이 과정이 시퀀스 종료 토큰이 나올 때까지 반복된다. 모든 Sub-layer의 출력 차원은 $d_{model} = 512$ 로 통일된다.

8. Feed-Forward Network, Layer Normalization, Residual Connection

8.1 Position-wise Feed-Forward Network (FFN)

각 Attention Sub-layer 뒤에는 Position-wise FFN이 위치한다. "Position-wise"라는 것은 각 위치(토큰)에 대해 독립적으로, 동일한 가중치를 공유하며 적용된다는 의미다.

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

이는 두 개의 Linear Transformation 사이에 ReLU 활성화 함수를 끼워넣은 구조다. 입력과 출력의 차원은 $d_{model} = 512$ 이고, 내부 차원은 $d_{ff} = 2048$ 이다. 즉 4배로 확장했다가 다시 원래 크기로 축소하는 Bottleneck 구조다.

이 FFN은 1x1 Convolution 두 번과 동치이며, 각 토큰에 대해 비선형 변환을 수행하여 Attention이 포착한 관계 정보를 더 풍부한 표현으로 변환하는 역할을 한다.

8.2 Residual Connection

각 Sub-layer의 입력을 출력에 더하는 Skip Connection이다.

\text{output} = x + \text{Sublayer}(x)

이 설계는 ResNet에서 차용한 것으로, 깊은 네트워크에서 Gradient가 원활하게 흐를 수 있도록 하여 학습을 안정화한다. Residual Connection이 제대로 동작하려면 더해지는 두 텐서의 차원이 동일해야 하므로, 모든 Sub-layer와 Embedding의 출력 차원이 $d_{model} = 512$ 로 통일된 것이다.

8.3 Layer Normalization

각 Sub-layer 출력에 Layer Normalization을 적용한다. Batch Normalization과 달리, Layer Normalization은 하나의 샘플 내에서 모든 Feature에 대해 정규화를 수행하므로 배치 크기에 독립적이다.

\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma + \epsilon} + \beta

여기서 $\mu$ 와 $\sigma$ 는 해당 레이어의 모든 차원에 대한 평균과 표준편차이고, $\gamma$ 와 $\beta$ 는 학습 가능한 파라미터다. 논문에서는 Post-Norm(Sublayer 출력 + Residual 이후에 LN 적용) 방식을 사용했다.

9. 학습 전략

9.1 Optimizer 및 Learning Rate Schedule

논문은 Adam Optimizer를 사용하되, 독특한 Learning Rate 스케줄을 적용했다. 이 스케줄은 이후 "Noam Scheduler"라는 이름으로 널리 알려지게 된다.

lr = d_{model}^{-0.5} \cdot \min(step^{-0.5}, \; step \cdot warmup\_steps^{-1.5})

이 스케줄의 핵심은 Warmup이다. 처음 $warmup\_steps$ (논문에서는 4,000 스텝) 동안은 Learning Rate를 선형적으로 증가시키고, 이후에는 스텝 수의 역제곱근에 비례하여 감소시킨다.

Warmup이 필요한 이유는, 학습 초기에 Adam의 2차 모멘트 추정이 불안정하기 때문이다. 초기에 Learning Rate를 낮게 유지하면 파라미터가 급격하게 변하는 것을 방지하고, 모멘트 추정이 안정화된 후에 본격적으로 학습할 수 있다.

Adam Optimizer의 하이퍼파라미터는 $\beta_1 = 0.9$ , $\beta_2 = 0.98$ , $\epsilon = 10^{-9}$ 이다. $\beta_2$ 가 일반적인 0.999보다 작은 0.98로 설정된 것이 특이한데, 이는 Attention Score 분포의 빠른 변화에 적응하기 위한 것으로 해석된다.

9.2 Regularization

Residual Dropout: 각 Sub-layer의 출력에 Dropout(rate = 0.1)을 적용한 후 Residual Connection을 수행한다. 또한 Encoder와 Decoder 모두에서 Embedding + Positional Encoding의 합에도 Dropout을 적용한다.

Label Smoothing: $\epsilon_{ls} = 0.1$ 의 Label Smoothing을 적용했다. 이는 정답 클래스의 타겟 확률을 1이 아닌 $1 - \epsilon_{ls}$ 로, 나머지 클래스의 타겟 확률을 $\epsilon_{ls} / (K - 1)$ 로 설정하는 기법이다. 논문에서는 Label Smoothing이 Perplexity를 악화시키지만, 정확도(Accuracy)와 BLEU Score를 향상시킨다고 보고했다. 모델이 지나치게 확신하는 것을 방지하여 일반화 성능을 높이는 효과가 있기 때문이다.

9.3 학습 데이터 및 하드웨어

WMT 2014 English-German: 약 450만 문장 쌍, Byte-Pair Encoding(BPE)으로 약 37,000개의 공유 어휘 사용
WMT 2014 English-French: 약 3,600만 문장 쌍, 32,000개의 Word-piece 어휘 사용
Batch: 약 25,000개의 Source 토큰 + 25,000개의 Target 토큰 포함
Hardware: 8개의 NVIDIA P100 GPU
학습 시간: Base 모델 약 12시간 (100K steps), Big 모델 약 3.5일 (300K steps)

10. 핵심 실험 결과

10.1 기계 번역 성능

Model	EN-DE BLEU	EN-FR BLEU	Training Cost (FLOPs)
Transformer (Base)	27.3	38.1	$3.3 \times 10^{18}$
Transformer (Big)	28.4	41.8	$2.3 \times 10^{19}$
기존 SOTA (Ensemble 포함)	26.36	41.29	-

Transformer Big 모델은 EN-DE에서 기존 최고 성능을 2 BLEU 이상 앞섰으며, EN-FR에서도 새로운 SOTA를 기록했다. 더 놀라운 것은, 이 성능을 기존 모델들의 학습 비용의 일부분으로 달성했다는 점이다.

10.2 모델 크기 비교

Config	$N$	$d_{model}$	$d_{ff}$	$h$	$d_k$	Parameters
Base	6	512	2048	8	64	65M
Big	6	1024	4096	16	64	213M

10.3 Ablation Study 핵심 결과

논문의 Ablation Study는 각 설계 결정의 중요성을 명확히 보여준다.

Attention Head 수: $h = 1$ 이면 0.9 BLEU 하락, $h = 16$ 이나 $h = 32$ 에서는 $d_k$ 가 너무 작아져 성능 저하
$d_k$ (Key 차원): 줄이면 품질 하락. Dot-Product Attention의 표현력에 직접적으로 영향
$d_{model}$ (모델 차원): 크게 할수록 일관되게 성능 향상
Dropout: 없으면 과적합 발생하여 성능 크게 하락
Positional Encoding: 학습 가능 방식과 Sinusoidal 방식이 거의 동일한 성능

10.4 English Constituency Parsing

번역 이외의 태스크에서의 일반화 능력을 검증하기 위해 English Constituency Parsing(구문 분석)에도 적용했다. WSJ 데이터만 사용했을 때 91.3 F1, 반지도 학습 설정에서 92.7 F1을 달성하며, 태스크 특화 모델들과 경쟁력 있는 성능을 보였다. 이는 Transformer가 기계 번역에만 국한되지 않는 범용 시퀀스 모델임을 입증했다.

11. 후속 연구에 대한 영향

Transformer 아키텍처는 현대 AI의 거의 모든 주요 발전의 기반이 되었다.

11.1 BERT (2018, Google)

Transformer의 Encoder 부분만 사용하여 양방향(Bidirectional) 사전 학습을 수행했다. Masked Language Modeling(MLM)과 Next Sentence Prediction(NSP)이라는 두 가지 사전 학습 과제를 통해, 11개의 NLP 벤치마크에서 SOTA를 달성했다. BERT는 NLP에서의 Transfer Learning 패러다임을 확립했다.

11.2 GPT 시리즈 (2018~, OpenAI)

Transformer의 Decoder 부분만 사용하여 Auto-regressive Language Modeling을 수행했다. GPT-1(117M) -> GPT-2(1.5B) -> GPT-3(175B)로 스케일업하면서, Scaling Law의 위력을 입증했다. GPT-3는 Few-shot Learning 능력을 보여주며 AI의 새로운 가능성을 열었고, 이후 ChatGPT, GPT-4로 이어지는 대규모 언어 모델(LLM) 혁명의 출발점이 되었다.

11.3 그 이후

T5 (2019): 모든 NLP 태스크를 Text-to-Text 형식으로 통합, Encoder-Decoder 전체 구조 사용
ViT (2020): Transformer를 Computer Vision에 적용, 이미지를 패치로 분할하여 시퀀스로 처리
DALL-E, Stable Diffusion: 이미지 생성에 Transformer 활용
AlphaFold 2: 단백질 구조 예측에 Attention 메커니즘 활용

한 편의 논문이 NLP를 넘어 Computer Vision, 생물학, 음악, 로보틱스까지 거의 모든 AI 분야를 변혁시킨 것이다.

12. PyTorch 핵심 코드 예시

12.1 Scaled Dot-Product Attention

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    mask: torch.Tensor = None,
    dropout: nn.Dropout = None
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Scaled Dot-Product Attention 구현.
    Args:
        query: (batch, h, seq_len, d_k)
        key:   (batch, h, seq_len, d_k)
        value: (batch, h, seq_len, d_v)
        mask:  Attention mask (optional)
    Returns:
        output: (batch, h, seq_len, d_v)
        attention_weights: (batch, h, seq_len, seq_len)
    """
    d_k = query.size(-1)

    # Step 1 & 2: QK^T / sqrt(d_k)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

    # Masking (Decoder Self-Attention 등)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Step 3: Softmax
    attention_weights = F.softmax(scores, dim=-1)

    if dropout is not None:
        attention_weights = dropout(attention_weights)

    # Step 4: Weighted sum of values
    output = torch.matmul(attention_weights, value)
    return output, attention_weights

12.2 Multi-Head Attention

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int = 512, h: int = 8, dropout: float = 0.1):
        super().__init__()
        assert d_model % h == 0, "d_model must be divisible by h"

        self.d_model = d_model
        self.h = h
        self.d_k = d_model // h

        # Q, K, V, Output 각각의 Linear projection
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: torch.Tensor = None
    ) -> torch.Tensor:
        batch_size = query.size(0)

        # 1) Linear projection 후 (batch, h, seq_len, d_k)로 reshape
        Q = self.W_q(query).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

        # 2) Scaled Dot-Product Attention (모든 head 병렬 수행)
        attn_output, attn_weights = scaled_dot_product_attention(
            Q, K, V, mask=mask, dropout=self.dropout
        )

        # 3) Head 결과 Concatenate: (batch, seq_len, d_model)
        attn_output = (
            attn_output.transpose(1, 2)
            .contiguous()
            .view(batch_size, -1, self.d_model)
        )

        # 4) Final linear projection
        return self.W_o(attn_output)

12.3 Positional Encoding

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int = 512, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        # (max_len, d_model) 크기의 Positional Encoding 행렬 생성
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # (max_len, 1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )  # (d_model/2,)

        pe[:, 0::2] = torch.sin(position * div_term)  # 짝수 차원: sin
        pe[:, 1::2] = torch.cos(position * div_term)  # 홀수 차원: cos

        pe = pe.unsqueeze(0)  # (1, max_len, d_model) - batch 차원 추가
        self.register_buffer('pe', pe)  # 학습 파라미터가 아닌 buffer로 등록

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (batch, seq_len, d_model) - Embedding 출력
        Returns:
            (batch, seq_len, d_model) - Positional Encoding이 더해진 결과
        """
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

12.4 Transformer Encoder Layer

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model: int = 512, h: int = 8, d_ff: int = 2048, dropout: float = 0.1):
        super().__init__()

        # Sub-layer 1: Multi-Head Self-Attention
        self.self_attn = MultiHeadAttention(d_model, h, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)

        # Sub-layer 2: Position-wise FFN
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        # Sub-layer 1: Self-Attention + Residual + LayerNorm
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))

        # Sub-layer 2: FFN + Residual + LayerNorm
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))

        return x

위 코드에서 self.self_attn(x, x, x, mask)에서 Q, K, V가 모두 동일한 입력 x에서 생성되기 때문에 "Self"-Attention이라고 부른다. Cross-Attention의 경우 K와 V에 Encoder 출력을 전달하면 된다.

13. 마무리

"Attention Is All You Need"는 단순히 기계 번역 모델 하나를 제안한 논문이 아니다. Recurrence라는 오랜 관성을 깨고, Attention만으로 충분하다는 대담한 주장을 실증적으로 증명한 논문이다.

이 논문의 핵심 기여를 정리하면 다음과 같다.

Recurrence 제거: 병렬 처리 가능한 아키텍처로 학습 속도를 획기적으로 개선
Self-Attention: 시퀀스 내 모든 토큰 간의 관계를 $O(1)$ 경로 길이로 직접 모델링
Multi-Head Attention: 다양한 관점에서 동시에 관계를 포착
Scalability: 단순하면서도 확장 가능한 아키텍처 설계로, 이후 수십억~수조 파라미터 모델까지 스케일업 가능

이 논문이 발표된 2017년 이후, Transformer는 NLP를 넘어 Vision, Audio, Biology, Robotics 등 AI의 거의 모든 영역으로 확산되었다. 논문 제목 그대로, Attention이 정말로 필요한 전부(All You Need)였다.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762
논문 전문 (HTML 버전): https://arxiv.org/html/1706.03762v7
NeurIPS 공식 PDF: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf
Jay Alammar, The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
Harvard NLP, The Annotated Transformer: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers. https://arxiv.org/abs/1810.04805
Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-Training (GPT). https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
UvA Deep Learning Tutorials - Transformers and Multi-Head Attention: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html
Wikipedia - Attention Is All You Need: https://en.wikipedia.org/wiki/Attention_Is_All_You_Need

Attention Is All You Need - A Complete Analysis of the Transformer Paper

1. Paper Overview
2. Background and Motivation: The Limitations of RNN/LSTM
- 2.1 The Bottleneck of Sequential Processing
- 2.2 The Emergence and Limitations of Attention
3. Self-Attention Mechanism
- 3.1 Core Concept: Query, Key, Value
- 3.2 Intuitive Understanding
4. Scaled Dot-Product Attention
- 4.1 Formula
- 4.2 Masking
5. Multi-Head Attention
6. Positional Encoding
7. Full Encoder-Decoder Architecture
8. Feed-Forward Network, Layer Normalization, Residual Connection
9. Training Strategy
10. Key Experimental Results
11. Impact on Subsequent Research
12. Core PyTorch Code Examples
13. Conclusion
References

1. Paper Overview

"Attention Is All You Need" is a paper presented at NeurIPS 2017, co-authored by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin from Google Brain and Google Research. This paper demonstrated that a Sequence-to-Sequence model could be constructed using only the Attention mechanism, completely eliminating traditional Recurrence and Convolution — a true turning point in the history of deep learning.

The Transformer architecture proposed in the paper achieved 28.4 BLEU on the WMT 2014 English-to-German translation task and 41.8 BLEU on English-to-French, surpassing all existing models. More importantly, this architecture subsequently became the foundation for virtually all major modern AI models, including BERT, GPT, T5, and ViT.

2. Background and Motivation: The Limitations of RNN/LSTM

2.1 The Bottleneck of Sequential Processing

Before the Transformer, the standard for Sequence Modeling was RNN (Recurrent Neural Network) and its variants LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit). These architectures process sequences in order $t = 1, 2, ..., n$ , updating the hidden state $h_t$ at each step.

h_t = f(h_{t-1}, x_t)

This sequential nature gave rise to two fundamental problems.

First, parallelization was impossible. Since the computation at each time step depends on the result of the previous step, the parallel processing capabilities of GPUs could not be effectively utilized. Training time increased linearly as sequence length grew.

Second, the Long-range Dependency problem. Although LSTM was theoretically capable of learning long-term dependencies, in practice it became increasingly difficult to capture relationships between distant tokens as sequences grew longer. This is because all past information must be compressed into a fixed-size vector called the hidden state.

2.2 The Emergence and Limitations of Attention

The Attention mechanism proposed by Bahdanau et al. (2014) greatly alleviated the Long-range Dependency problem by allowing the Decoder to directly access all of the Encoder's hidden states. However, since Attention was still added on top of RNNs, the bottleneck of Sequential Processing remained.

The paper's core question was precisely this: "Is Attention alone sufficient, without Recurrence?"

The answer was Yes, and the result was the Transformer.

3. Self-Attention Mechanism

3.1 Core Concept: Query, Key, Value

The key idea behind Self-Attention is that each token in a sequence directly computes its relationship with every other token. To achieve this, each input vector is transformed into three roles.

Query (Q): "What information am I looking for?"
Key (K): "What is the identifier of the information I can provide?"
Value (V): "What is the actual information I convey?"

Given an input sequence $X \in \mathbb{R}^{n \times d_{model}}$ , Q, K, and V are generated through learnable weight matrices.

Q = XW^Q, \quad K = XW^K, \quad V = XW^V

where $W^Q, W^K \in \mathbb{R}^{d_{model} \times d_k}$ and $W^V \in \mathbb{R}^{d_{model} \times d_v}$ .

3.2 Intuitive Understanding

An information retrieval analogy makes this easier to understand. Imagine searching for a book in a library: the Query is the search term "deep learning introductory book," the Key is each book's title or tag, and the Value is the actual content of the book. The essence of Self-Attention is retrieving more Value from books whose Key has higher similarity to the Query.

The decisive difference between Self-Attention and RNNs is that the path length between any two tokens in the sequence is always $O(1)$ . For RNNs it is $O(n)$ , and for CNNs it is $O(\log_k n)$ (dilated) or $O(n/k)$ (standard). This short path length is what enables effective learning of Long-range Dependencies.

4. Scaled Dot-Product Attention

4.1 Formula

The exact formula for the Attention function proposed in the paper is as follows.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Let us decompose this formula step by step.

Step 1: Similarity Computation ( $QK^T$ )

The Dot Product of Query and Key is computed. The result is an $n \times n$ Attention Score matrix. Each element $(i, j)$ represents the similarity between the Query of the $i$ -th token and the Key of the $j$ -th token.

Step 2: Scaling ( $\frac{1}{\sqrt{d_k}}$ )

As $d_k$ grows larger, the variance of the Dot Product values increases, causing the gradients of Softmax to become extremely small. Specifically, if each component of $q$ and $k$ is an independent random variable with mean 0 and variance 1, the variance of $q \cdot k$ is $d_k$ . Dividing by $\sqrt{d_k}$ normalizes the variance to 1, allowing the Softmax to operate stably.

The paper also confirmed the importance of this Scaling experimentally: when $d_k$ was small, Additive Attention and Dot-Product Attention performed similarly, but when $d_k$ was large, Dot-Product Attention without Scaling degraded significantly.

Step 3: Softmax

Softmax is applied to the scaled scores to obtain Attention Weights. Since each row sums to 1, these serve as weights for a weighted average over the Values.

Step 4: Weighted Sum with Values

Finally, the matrix multiplication of Attention Weights and Values yields an output for each token that is a vector summing all tokens' Values proportionally to their relevance.

4.2 Masking

In the Decoder's Self-Attention, information from future tokens must be prevented from leaking to the current token. To achieve this, Masked Attention sets the scores at future positions to $-\infty$ before Softmax.

\text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V

Here, $M$ is an Upper Triangular Matrix with $-\infty$ at disallowed positions and $0$ at allowed positions.

5. Multi-Head Attention

5.1 Limitations of Single Attention

Using only a single Attention function forces the model to capture token relationships from only one perspective. For example, in the sentence "The cat sat on the mat because it was tired," it becomes difficult to simultaneously capture the syntactic relationship that "it" refers to "cat" and the semantic relationship that "tired" describes the state of "cat."

5.2 Multi-Head Attention Structure

The paper solved this problem by running multiple Attention functions in parallel.

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h) W^O

where each head is defined as follows.

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

The weight matrices for each head are $W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$ , $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$ , $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$ , and the final output projection is $W^O \in \mathbb{R}^{hd_v \times d_{model}}$ .

5.3 Paper's Configuration

The paper used $h = 8$ heads with $d_k = d_v = d_{model} / h = 64$ . Because the total dimension is divided by the number of heads, the total computational cost of Multi-Head Attention is nearly identical to that of Single-Head Attention.

According to the paper's Ablation Study, BLEU dropped by 0.9 points when only 1 head was used, and when there were too many heads (e.g., 32), $d_k$ became too small, actually hurting performance.

5.4 Three Usage Patterns

Multi-Head Attention is used in three places within the Transformer.

Encoder Self-Attention: Within the Encoder, each token of the input sequence attends to all other tokens. Q, K, and V are all generated from the output of the previous Encoder layer.
Decoder Self-Attention (Masked): Masked Attention within the Decoder that can only reference tokens generated so far.
Encoder-Decoder Attention (Cross-Attention): The Decoder's Query attends to the Encoder's Key and Value. This is the component most similar to the Attention in conventional Seq2Seq models.

6. Positional Encoding

6.1 Necessity

Self-Attention is inherently order-agnostic (permutation invariant). Even if the order of input tokens is shuffled, the Attention output values remain the same (only their order changes). Since word order carries crucial information in natural language, positional information must be explicitly injected.

6.2 Sinusoidal Positional Encoding

The paper proposed Positional Encoding using sine and cosine functions.

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)

Here, $pos$ is the position within the sequence and $i$ is the dimension index. This Encoding is added element-wise to the input Embedding and passed to the model.

6.3 Why Sinusoidal?

There are clear reasons for choosing this function.

Relative Position Representation: For any fixed offset $k$ , $PE_{pos+k}$ can be expressed as a linear transformation of $PE_{pos}$ . This enables the model to easily learn relative positional relationships.

\begin{bmatrix} \sin(pos \cdot \omega + k \cdot \omega) \\ \cos(pos \cdot \omega + k \cdot \omega) \end{bmatrix} = \begin{bmatrix} \cos(k\omega) & \sin(k\omega) \\ -\sin(k\omega) & \cos(k\omega) \end{bmatrix} \begin{bmatrix} \sin(pos \cdot \omega) \\ \cos(pos \cdot \omega) \end{bmatrix}

Generalization without Learning: The encoding can naturally extend to longer sequences not seen during training. When compared with learnable Positional Embeddings, the paper reported that both approaches yielded "nearly identical results," and ultimately chose the Sinusoidal approach for its generalization capability.

Frequency Spectrum: Lower dimensions (smaller $i$ ) have shorter wavelengths for fine-grained position distinction, while higher dimensions have longer wavelengths for encoding broader positional relationships.

7. Full Encoder-Decoder Architecture

7.1 Encoder Structure

The Encoder consists of $N = 6$ identical layers. Each layer has two Sub-layers.

Multi-Head Self-Attention
Position-wise Feed-Forward Network

Residual Connection and Layer Normalization are applied to each Sub-layer.

\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))

7.2 Decoder Structure

The Decoder also consists of $N = 6$ identical layers, but unlike the Encoder, each layer has three Sub-layers.

Masked Multi-Head Self-Attention: Masks future positions to maintain the auto-regressive property.
Multi-Head Cross-Attention: Uses the Encoder's output as Key and Value.
Position-wise Feed-Forward Network

7.3 Overall Flow

The input sequence passes through Embedding + Positional Encoding and enters the Encoder, and the Encoder output after 6 layers is passed to the Decoder's Cross-Attention. The Decoder takes previously generated tokens as input and outputs a probability distribution over the next token, repeating this process until the end-of-sequence token is produced. The output dimension of all Sub-layers is unified at $d_{model} = 512$ .

8. Feed-Forward Network, Layer Normalization, Residual Connection

8.1 Position-wise Feed-Forward Network (FFN)

A Position-wise FFN follows each Attention Sub-layer. "Position-wise" means it is applied independently to each position (token) with shared weights.

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

This is a structure with a ReLU activation function sandwiched between two Linear Transformations. The input and output dimensions are $d_{model} = 512$ , and the inner dimension is $d_{ff} = 2048$ . In other words, it is a Bottleneck structure that expands by a factor of 4 and then contracts back to the original size.

This FFN is equivalent to two 1x1 Convolutions and performs nonlinear transformations on each token, converting the relational information captured by Attention into richer representations.

8.2 Residual Connection

This is a Skip Connection that adds the input of each Sub-layer to its output.

\text{output} = x + \text{Sublayer}(x)

This design, borrowed from ResNet, stabilizes training by allowing gradients to flow smoothly through deep networks. For Residual Connections to work properly, the dimensions of the two tensors being added must be identical, which is why the output dimensions of all Sub-layers and Embeddings are unified at $d_{model} = 512$ .

8.3 Layer Normalization

Layer Normalization is applied to the output of each Sub-layer. Unlike Batch Normalization, Layer Normalization normalizes across all Features within a single sample, making it independent of batch size.

\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma + \epsilon} + \beta

Here, $\mu$ and $\sigma$ are the mean and standard deviation across all dimensions of the layer, and $\gamma$ and $\beta$ are learnable parameters. The paper used the Post-Norm approach (applying LN after the Sublayer output + Residual).

9. Training Strategy

9.1 Optimizer and Learning Rate Schedule

The paper used the Adam Optimizer with a distinctive Learning Rate schedule. This schedule later became widely known as the "Noam Scheduler."

lr = d_{model}^{-0.5} \cdot \min(step^{-0.5}, \; step \cdot warmup\_steps^{-1.5})

The key feature of this schedule is Warmup. During the first $warmup\_steps$ (4,000 steps in the paper), the Learning Rate increases linearly, and afterwards it decreases proportionally to the inverse square root of the step number.

Warmup is necessary because Adam's second moment estimates are unstable in the early stages of training. Keeping the Learning Rate low initially prevents parameters from changing drastically and allows training to begin in earnest once moment estimates have stabilized.

The Adam Optimizer hyperparameters were $\beta_1 = 0.9$ , $\beta_2 = 0.98$ , $\epsilon = 10^{-9}$ . It is noteworthy that $\beta_2$ was set to 0.98, lower than the typical 0.999, which is interpreted as adapting to the rapid changes in Attention Score distributions.

9.2 Regularization

Residual Dropout: Dropout (rate = 0.1) is applied to the output of each Sub-layer before the Residual Connection. Dropout is also applied to the sum of Embedding + Positional Encoding in both the Encoder and Decoder.

Label Smoothing: Label Smoothing with $\epsilon_{ls} = 0.1$ was applied. This technique sets the target probability of the correct class to $1 - \epsilon_{ls}$ rather than 1, and the target probability of other classes to $\epsilon_{ls} / (K - 1)$ . The paper reported that Label Smoothing worsens Perplexity but improves Accuracy and BLEU Score. This is because it prevents the model from becoming overconfident, thereby improving generalization performance.

9.3 Training Data and Hardware

WMT 2014 English-German: Approximately 4.5 million sentence pairs, using Byte-Pair Encoding (BPE) with a shared vocabulary of approximately 37,000 tokens
WMT 2014 English-French: Approximately 36 million sentence pairs, using a 32,000 Word-piece vocabulary
Batch: Containing approximately 25,000 Source tokens + 25,000 Target tokens
Hardware: 8 NVIDIA P100 GPUs
Training Time: Approximately 12 hours for the Base model (100K steps), approximately 3.5 days for the Big model (300K steps)

10. Key Experimental Results

10.1 Machine Translation Performance

Model	EN-DE BLEU	EN-FR BLEU	Training Cost (FLOPs)
Transformer (Base)	27.3	38.1	$3.3 \times 10^{18}$
Transformer (Big)	28.4	41.8	$2.3 \times 10^{19}$
Previous SOTA (including Ensemble)	26.36	41.29	-

The Transformer Big model surpassed the previous best performance on EN-DE by more than 2 BLEU and set a new SOTA on EN-FR as well. What is even more remarkable is that this performance was achieved at a fraction of the training cost of existing models.

10.2 Model Size Comparison

Config	$N$	$d_{model}$	$d_{ff}$	$h$	$d_k$	Parameters
Base	6	512	2048	8	64	65M
Big	6	1024	4096	16	64	213M

10.3 Key Ablation Study Results

The paper's Ablation Study clearly demonstrates the importance of each design decision.

Number of Attention Heads: $h = 1$ resulted in a 0.9 BLEU drop; $h = 16$ or $h = 32$ caused performance degradation because $d_k$ became too small
$d_k$ (Key Dimension): Reducing it led to quality degradation. It directly affects the representational capacity of Dot-Product Attention
$d_{model}$ (Model Dimension): Performance consistently improved with larger values
Dropout: Without it, overfitting occurred with significant performance drops
Positional Encoding: Learnable and Sinusoidal approaches achieved nearly identical performance

10.4 English Constituency Parsing

To verify generalization ability beyond translation, the model was also applied to English Constituency Parsing. It achieved 91.3 F1 using only WSJ data and 92.7 F1 in a semi-supervised setting, showing competitive performance with task-specific models. This demonstrated that the Transformer is a general-purpose sequence model not limited to machine translation.

11. Impact on Subsequent Research

The Transformer architecture has become the foundation for virtually all major advances in modern AI.

11.1 BERT (2018, Google)

Using only the Encoder portion of the Transformer, BERT performed bidirectional pre-training. Through two pre-training tasks — Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) — it achieved SOTA on 11 NLP benchmarks. BERT established the Transfer Learning paradigm in NLP.

11.2 GPT Series (2018~, OpenAI)

Using only the Decoder portion of the Transformer, GPT performed Auto-regressive Language Modeling. Scaling up from GPT-1 (117M) to GPT-2 (1.5B) to GPT-3 (175B) demonstrated the power of Scaling Laws. GPT-3 showcased Few-shot Learning capabilities, opening new possibilities for AI and becoming the starting point for the Large Language Model (LLM) revolution that led to ChatGPT and GPT-4.

11.3 Beyond

T5 (2019): Unified all NLP tasks into a Text-to-Text format, using the full Encoder-Decoder structure
ViT (2020): Applied the Transformer to Computer Vision, dividing images into patches and processing them as sequences
DALL-E, Stable Diffusion: Leveraged Transformers for image generation
AlphaFold 2: Utilized the Attention mechanism for protein structure prediction

A single paper has transformed nearly every field of AI, from NLP to Computer Vision, biology, music, and robotics.

12. Core PyTorch Code Examples

12.1 Scaled Dot-Product Attention

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    mask: torch.Tensor = None,
    dropout: nn.Dropout = None
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Scaled Dot-Product Attention implementation.
    Args:
        query: (batch, h, seq_len, d_k)
        key:   (batch, h, seq_len, d_k)
        value: (batch, h, seq_len, d_v)
        mask:  Attention mask (optional)
    Returns:
        output: (batch, h, seq_len, d_v)
        attention_weights: (batch, h, seq_len, seq_len)
    """
    d_k = query.size(-1)

    # Step 1 & 2: QK^T / sqrt(d_k)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

    # Masking (for Decoder Self-Attention, etc.)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Step 3: Softmax
    attention_weights = F.softmax(scores, dim=-1)

    if dropout is not None:
        attention_weights = dropout(attention_weights)

    # Step 4: Weighted sum of values
    output = torch.matmul(attention_weights, value)
    return output, attention_weights

12.2 Multi-Head Attention

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int = 512, h: int = 8, dropout: float = 0.1):
        super().__init__()
        assert d_model % h == 0, "d_model must be divisible by h"

        self.d_model = d_model
        self.h = h
        self.d_k = d_model // h

        # Linear projections for Q, K, V, and Output
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: torch.Tensor = None
    ) -> torch.Tensor:
        batch_size = query.size(0)

        # 1) Linear projection then reshape to (batch, h, seq_len, d_k)
        Q = self.W_q(query).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)

        # 2) Scaled Dot-Product Attention (all heads in parallel)
        attn_output, attn_weights = scaled_dot_product_attention(
            Q, K, V, mask=mask, dropout=self.dropout
        )

        # 3) Concatenate head results: (batch, seq_len, d_model)
        attn_output = (
            attn_output.transpose(1, 2)
            .contiguous()
            .view(batch_size, -1, self.d_model)
        )

        # 4) Final linear projection
        return self.W_o(attn_output)

12.3 Positional Encoding

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int = 512, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        # Create Positional Encoding matrix of size (max_len, d_model)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # (max_len, 1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )  # (d_model/2,)

        pe[:, 0::2] = torch.sin(position * div_term)  # Even dimensions: sin
        pe[:, 1::2] = torch.cos(position * div_term)  # Odd dimensions: cos

        pe = pe.unsqueeze(0)  # (1, max_len, d_model) - add batch dimension
        self.register_buffer('pe', pe)  # Register as buffer, not a learnable parameter

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (batch, seq_len, d_model) - Embedding output
        Returns:
            (batch, seq_len, d_model) - Result with Positional Encoding added
        """
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

12.4 Transformer Encoder Layer

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model: int = 512, h: int = 8, d_ff: int = 2048, dropout: float = 0.1):
        super().__init__()

        # Sub-layer 1: Multi-Head Self-Attention
        self.self_attn = MultiHeadAttention(d_model, h, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)

        # Sub-layer 2: Position-wise FFN
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        # Sub-layer 1: Self-Attention + Residual + LayerNorm
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))

        # Sub-layer 2: FFN + Residual + LayerNorm
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_output))

        return x

In the code above, self.self_attn(x, x, x, mask) passes the same input x for Q, K, and V, which is why it is called "Self"-Attention. For Cross-Attention, the Encoder output would be passed for K and V.

13. Conclusion

"Attention Is All You Need" is not merely a paper proposing a single machine translation model. It is a paper that broke the long-standing inertia of Recurrence and empirically proved the bold claim that Attention alone is sufficient.

The core contributions of this paper can be summarized as follows.

Elimination of Recurrence: Dramatically improved training speed with a parallelizable architecture
Self-Attention: Directly models relationships between all tokens in a sequence with $O(1)$ path length
Multi-Head Attention: Captures relationships simultaneously from multiple perspectives
Scalability: A simple yet scalable architectural design that enables scaling up to billions and even trillions of parameters

Since its publication in 2017, the Transformer has spread from NLP to Vision, Audio, Biology, Robotics, and nearly every other domain of AI. True to the paper's title, Attention really was All You Need.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762
Full paper (HTML version): https://arxiv.org/html/1706.03762v7
NeurIPS Official PDF: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf
Jay Alammar, The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
Harvard NLP, The Annotated Transformer: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers. https://arxiv.org/abs/1810.04805
Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-Training (GPT). https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
UvA Deep Learning Tutorials - Transformers and Multi-Head Attention: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html
Wikipedia - Attention Is All You Need: https://en.wikipedia.org/wiki/Attention_Is_All_You_Need