Skip to content
Published on

Vision Transformer (ViT) Paper In-Depth Analysis: An Image is Worth 16x16 Words

Authors
  • Name
    Twitter

1. Paper Overview

"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" was published in October 2020 by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and others from Google Research, and was officially accepted at ICLR 2021. This paper demonstrated that the Transformer architecture, which achieved overwhelming success in NLP, could be applied directly to the image classification task in Computer Vision with virtually no modifications.

The core idea is remarkably simple: divide an image into fixed-size patches, treat each patch as a token (analogous to tokens in NLP), and feed them into a standard Transformer Encoder. The paper's title itself intuitively captures this idea -- an image is made up of 16x16-sized words.

The results were groundbreaking. The ViT-H/14 model, pre-trained on the large-scale JFT-300M dataset, achieved 88.55% top-1 accuracy on ImageNet, surpassing the existing CNN-based SOTA (State-of-the-Art) models BiT (Big Transfer) and Noisy Student. More importantly, the training cost required to achieve this was significantly lower than existing methods.

This paper triggered a paradigm shift in the field of Computer Vision. After ViT, numerous follow-up studies emerged -- DeiT, Swin Transformer, BEiT, MAE, DINO, and many more -- giving birth to an entirely new research field called "Vision Transformer."


2. Background: Computer Vision Dominated by CNNs

2.1 The Golden Age of Convolutional Neural Networks

After AlexNet's ImageNet Challenge victory in 2012, CNNs were effectively the only architecture in Computer Vision. The lineage of VGGNet (2014), GoogLeNet/Inception (2014), ResNet (2015), DenseNet (2017), and EfficientNet (2019) all shared convolution operations as their core.

The fundamental reasons CNNs succeeded in vision tasks lie in two inductive biases.

First, Locality. Convolution filters process only local regions of the input. This architecturally encodes the fact that adjacent pixels in images have strong correlations.

Second, Translation Equivariance. The same filter is shared and applied across all positions in the image, enabling the extraction of identical features regardless of where an object is located.

Thanks to these two biases, CNNs could effectively learn visual patterns even with relatively small amounts of data.

2.2 The Transformer Revolution in NLP

After the publication of "Attention Is All You Need" in 2017, the NLP field was completely transformed. Transformer-based models including BERT (2018), GPT-2 (2019), T5 (2019), and GPT-3 (2020) dominated virtually every NLP benchmark.

The key strengths of the Transformer are:

  • Global Receptive Field through Self-Attention: Direct interaction between any two positions in the sequence is possible
  • Excellent Parallelism: Unlike RNNs, all positions in the sequence can be processed simultaneously
  • Outstanding Scalability: Performance steadily improves as model size and data increase, exhibiting Scaling Laws

2.3 Previous Attempts to Apply Transformers to Vision

Prior to ViT, there were attempts to introduce attention mechanisms into Vision:

  • Non-local Neural Networks (Wang et al., 2018): Inserted Self-Attention blocks within CNNs to capture long-range dependencies
  • Stand-Alone Self-Attention (Ramachandran et al., 2019): Replaced convolutions with local self-attention
  • DETR (Carion et al., 2020): Leveraged Transformer Decoder for object detection

However, all of these either hybridized CNN and Attention or used specially designed attention mechanisms. ViT's innovation lies in directly applying the standard Transformer to Vision with minimal modifications, without such compromises.


3. Core Idea: Converting Images into Patch Sequences

3.1 Why Images Cannot Be Fed Directly into Transformers

The standard Transformer takes a 1D token sequence as input. Suppose we feed an image pixel by pixel into a Transformer. A 224x224 resolution image has 50,176 pixels. Since the computational complexity of Self-Attention is O(N2)O(N^2) with respect to sequence length NN, Self-Attention over 50,176 tokens would require approximately 2.5 billion operations. This is practically infeasible.

3.2 Reducing Sequence Length by Splitting into Patches

ViT's solution is elegant yet intuitive: divide the image into fixed-size patches and treat each patch as a single token.

When splitting a 224x224 image into 16x16 patches, the number of patches is:

N=H×WP2=224×22416×16=196N = \frac{H \times W}{P^2} = \frac{224 \times 224}{16 \times 16} = 196

Instead of 50,176 pixels, only 196 patch tokens need to be processed. Self-Attention complexity becomes 1962=38,416196^2 = 38,416, a reduction of approximately 65,000 times compared to the pixel-level approach.

The key premise of this transformation is: a single 16x16 patch can function as a semantic unit equivalent to a single word in NLP. Just as natural language is a sequence of discrete tokens called words, images can be represented as a sequence of visual tokens called patches.

3.3 Correspondence with NLP

NLPVision (ViT)
SentenceImage
Word/TokenImage Patch
VocabularySpace of possible patch patterns
Token EmbeddingPatch Embedding (Linear Projection)
Position EmbeddingPosition Embedding (1D Learnable)
[CLS] Token[CLS] Token
Transformer EncoderTransformer Encoder (identical)

This correspondence is essentially all there is to ViT. The essence of ViT is bringing the proven NLP Transformer structure over with minimal changes.


4. Detailed Architecture Analysis

4.1 Overall Pipeline Overview

The complete processing pipeline of ViT is as follows:

  1. Split the input image into fixed-size patches
  2. Flatten each patch into a 1D vector
  3. Transform patch vectors into D-dimensional embeddings via Linear Projection
  4. Prepend a learnable [CLS] token to the sequence
  5. Add Position Embeddings to inject positional information
  6. Feed into a standard Transformer Encoder
  7. Perform classification using the final output of the [CLS] token
Input Image (224x224x3)
     |
     v
Patch Split (196 patches of 16x16x3)
     |
     v
Flatten (196 vectors of 768 dimensions)
     |
     v
Linear Projection (196 D-dimensional Embeddings)
     |
     v
[CLS] Token Prepend (197 D-dimensional vectors)
     |
     v
Add Position Embedding (197 D-dimensional vectors)
     |
     v
Transformer Encoder (L blocks)
     |
     v
Extract [CLS] Token Output
     |
     v
Classification Head (MLP) -> Class Prediction

4.2 Patch Embedding: Turning Images into Tokens

4.2.1 Patch Splitting and Flattening

The input image xRH×W×C\mathbf{x} \in \mathbb{R}^{H \times W \times C} is split into P×PP \times P patches. Each patch is flattened to xpiRP2C\mathbf{x}_p^i \in \mathbb{R}^{P^2 \cdot C}.

For example, splitting a 224x224x3 (RGB) image into 16x16 patches:

  • Number of patches: N=2242/162=196N = 224^2 / 16^2 = 196
  • Dimension of each flattened patch vector: P2C=16×16×3=768P^2 \cdot C = 16 \times 16 \times 3 = 768

4.2.2 Linear Projection

A learnable Linear Projection ER(P2C)×D\mathbf{E} \in \mathbb{R}^{(P^2 \cdot C) \times D} is applied to each flattened patch vector, mapping it to a D-dimensional embedding space.

z0i=xpiE,i=1,2,,N\mathbf{z}_0^i = \mathbf{x}_p^i \mathbf{E}, \quad i = 1, 2, \ldots, N

Here DD is the Transformer's Hidden Dimension. For ViT-Base where D=768D = 768, the Linear Projection becomes a 768x768 matrix.

An interesting point is that this Linear Projection is mathematically equivalent to a convolution with stride equal to the patch size. That is, it is identical to Conv2d(in_channels=3, out_channels=D, kernel_size=P, stride=P). In practice, implementations typically use convolution as it is more efficient.

4.3 [CLS] Token and Classification Head

4.3.1 [CLS] Token

Borrowed from BERT, a learnable special token z00=xclass\mathbf{z}_0^0 = \mathbf{x}_{\text{class}} is prepended to the patch embedding sequence.

z0=[xclass;z01;z02;;z0N]\mathbf{z}_0 = [\mathbf{x}_{\text{class}}; \, \mathbf{z}_0^1; \, \mathbf{z}_0^2; \, \ldots; \, \mathbf{z}_0^N]

This [CLS] token interacts with all patch tokens through the Transformer's Self-Attention, learning a global representation of the entire image. The output zL0\mathbf{z}_L^0 of this [CLS] token from the final layer of the Transformer Encoder serves as the vector representing the entire image.

4.3.2 Classification Head

During pre-training, an MLP Head with one hidden layer is attached to the [CLS] token output. During fine-tuning, only a single Linear Layer is used.

y^=Linear(LN(zL0))\hat{y} = \text{Linear}(\text{LN}(\mathbf{z}_L^0))

The paper also experimented with Global Average Pooling (GAP) of the patch tokens instead of the [CLS] token, reporting similar performance. The [CLS] token was adopted as the default to maintain consistency with the NLP Transformer.

4.4 Position Embedding: Injecting Positional Information

4.4.1 Why Position Embedding Is Needed

Self-Attention is inherently permutation invariant. Reordering the input tokens simply reorders the output accordingly. However, the spatial location of patches in images carries important information. A patch in the top-left and one in the bottom-right of an image have different spatial meanings.

4.4.2 1D Learnable Position Embedding

ViT uses learnable 1D Position Embeddings EposR(N+1)×D\mathbf{E}_{pos} \in \mathbb{R}^{(N+1) \times D}, where N+1N+1 is the total sequence length including the [CLS] token.

z0=[xclass;xp1E;xp2E;;xpNE]+Epos\mathbf{z}_0 = [\mathbf{x}_{\text{class}}; \, \mathbf{x}_p^1\mathbf{E}; \, \mathbf{x}_p^2\mathbf{E}; \, \ldots; \, \mathbf{x}_p^N\mathbf{E}] + \mathbf{E}_{pos}

The paper conducted comparison experiments with 2D Position Embedding (encoding patch row/column positions separately), finding no significant performance difference between 1D and 2D Position Embeddings. This suggests that ViT can learn 2D spatial structure from 1D Position Embeddings on its own.

In fact, when visualizing the learned Position Embeddings (Figure 7 in the paper), spatially adjacent patches show high cosine similarity in their Position Embeddings, with row and column structures naturally emerging. This is an impressive result demonstrating that the model automatically learned 2D positional relationships from data.

4.5 Transformer Encoder

4.5.1 Architecture

ViT uses the standard Transformer Encoder as-is. Each block has the following structure:

z=MSA(LN(z1))+z1,=1,,L\mathbf{z}_\ell' = \text{MSA}(\text{LN}(\mathbf{z}_{\ell-1})) + \mathbf{z}_{\ell-1}, \quad \ell = 1, \ldots, L z=MLP(LN(z))+z,=1,,L\mathbf{z}_\ell = \text{MLP}(\text{LN}(\mathbf{z}_\ell')) + \mathbf{z}_\ell', \quad \ell = 1, \ldots, L

Where:

  • LN: Layer Normalization (Pre-norm style, different from the original Transformer's Post-norm)
  • MSA: Multi-head Self-Attention
  • MLP: Feed-Forward Network
  • Residual Connection: Input is added to each sub-block's output

What distinguishes ViT from the original Transformer is the adoption of Pre-norm. In the original "Attention Is All You Need," Layer Normalization was applied after the sub-layer output (Post-norm), but ViT applies it before the sub-layer input (Pre-norm). Pre-norm is known to provide better training stability.

4.5.2 Multi-head Self-Attention (MSA)

MSA splits the input into hh heads, performs Self-Attention independently, and then combines them.

MSA(z)=[head1;head2;;headh]WO\text{MSA}(\mathbf{z}) = [\text{head}_1; \, \text{head}_2; \, \ldots; \, \text{head}_h] \mathbf{W}^O headi=Attention(zWiQ,zWiK,zWiV)\text{head}_i = \text{Attention}(\mathbf{z}\mathbf{W}_i^Q, \, \mathbf{z}\mathbf{W}_i^K, \, \mathbf{z}\mathbf{W}_i^V) Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Each head dimension is dk=D/hd_k = D / h. For ViT-Base where D=768D = 768 and h=12h = 12, dk=64d_k = 64.

4.5.3 MLP (Feed-Forward Network)

The MLP consists of two Linear Layers with a GELU activation function.

MLP(z)=GELU(zW1+b1)W2+b2\text{MLP}(\mathbf{z}) = \text{GELU}(\mathbf{z}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2

The hidden dimension is typically set to 4 times the Embedding Dimension DD (4D4D). For ViT-Base, this is 768×4=3072768 \times 4 = 3072.

The use of GELU (Gaussian Error Linear Unit) instead of the ReLU used in the original Transformer follows BERT's design.


5. Mathematical Analysis: From Patches to Predictions

The entire process can be summarized mathematically as follows.

5.1 Input Processing

Given an image xRH×W×C\mathbf{x} \in \mathbb{R}^{H \times W \times C}:

Step 1. Patch splitting and flattening:

[xp1,xp2,,xpN],xpiRP2C,N=HWP2[\mathbf{x}_p^1, \mathbf{x}_p^2, \ldots, \mathbf{x}_p^N], \quad \mathbf{x}_p^i \in \mathbb{R}^{P^2 C}, \quad N = \frac{HW}{P^2}

Step 2. Patch Embedding + [CLS] Token + Position Embedding:

z0=[xclass;xp1E;;xpNE]+Epos\mathbf{z}_0 = [\mathbf{x}_{\text{class}}; \, \mathbf{x}_p^1\mathbf{E}; \, \ldots; \, \mathbf{x}_p^N\mathbf{E}] + \mathbf{E}_{pos}

Where ER(P2C)×D\mathbf{E} \in \mathbb{R}^{(P^2 C) \times D}, EposR(N+1)×D\mathbf{E}_{pos} \in \mathbb{R}^{(N+1) \times D}

5.2 Transformer Encoder

Step 3. Repeat through LL Transformer blocks:

z=MSA(LN(z1))+z1\mathbf{z}_\ell' = \text{MSA}(\text{LN}(\mathbf{z}_{\ell-1})) + \mathbf{z}_{\ell-1} z=MLP(LN(z))+z\mathbf{z}_\ell = \text{MLP}(\text{LN}(\mathbf{z}_\ell')) + \mathbf{z}_\ell'

5.3 Output

Step 4. Classification via [CLS] token output:

y=LN(zL0)\mathbf{y} = \text{LN}(\mathbf{z}_L^0) y^=Softmax(yWhead+bhead)\hat{y} = \text{Softmax}(\mathbf{y}\mathbf{W}_{\text{head}} + \mathbf{b}_{\text{head}})

5.4 Computational Complexity Analysis

For sequence length NN and embedding dimension DD, the complexities of key operations are:

OperationComplexity
Patch EmbeddingO(NP2CD)O(N \cdot P^2C \cdot D)
Self-Attention (QKV generation)O(ND2)O(N \cdot D^2)
Self-Attention (Attention computation)O(N2D)O(N^2 \cdot D)
MLPO(ND2)O(N \cdot D^2)
Total Transformer BlockO(N2D+ND2)O(N^2 \cdot D + N \cdot D^2)

The O(N2)O(N^2) complexity with respect to sequence length NN is the key bottleneck limiting ViT's resolution scalability. This motivated the introduction of Windowed Attention in subsequent works like Swin Transformer.


6. Model Variants: ViT-Base, Large, Huge

Following BERT's design conventions, ViT defines three model sizes. The number after the slash in the model name indicates the patch size (e.g., ViT-B/16 uses 16x16 patches).

SpecificationViT-Base (ViT-B)ViT-Large (ViT-L)ViT-Huge (ViT-H)
Layers (LL)122432
Hidden Dim (DD)76810241280
MLP Dim307240965120
Attention Heads (hh)121616
Parameters~86M~307M~632M
Head Dim (dk=D/hd_k = D/h)646480

6.1 Variants by Patch Size

Even with the same architecture, performance and computational cost vary significantly depending on patch size.

ModelPatch SizeSequence Length (224x224)Sequence Length (384x384)
ViT-B/3232x3249144
ViT-B/1616x16196576
ViT-L/1616x16196576
ViT-H/1414x14256784

Smaller patch sizes capture more fine-grained visual information, but the increased sequence length causes computational costs to grow as O(N2)O(N^2). ViT-H/14's 14x14 patches generate a sequence approximately 27 times longer than ViT-B/32's 32x32 patches.

6.2 Hybrid Model

The paper also experimented with Hybrid models combining CNN and Transformer. Intermediate feature maps from ResNet are used instead of patch embeddings. For example, the stage 4 output (14x14 feature map) of ResNet-50 is treated as 1x1 patches and fed into the Transformer.

z0i=ResNet_feature_map(i)Ehybrid+Eposi\mathbf{z}_0^i = \text{ResNet\_feature\_map}_{(i)} \mathbf{E}_{\text{hybrid}} + \mathbf{E}_{pos}^i

Experimental results showed that the Hybrid model outperformed pure ViT when pre-training data was limited, but pure ViT caught up when data scale was sufficiently large.


7. Training Strategy

7.1 Pre-training

ViT's training strategy follows the NLP paradigm of "pre-training + fine-tuning."

Datasets:

  • ImageNet-1K: Approximately 1.3 million images, 1,000 classes
  • ImageNet-21K: Approximately 14 million images, 21,843 classes
  • JFT-300M: Google's internal dataset, approximately 300 million images, 18,291 classes

Pre-training configuration:

  • Optimizer: Adam (β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999)
  • Batch Size: 4,096
  • Weight Decay: 0.1
  • Linear Learning Rate Warmup + Cosine Decay
  • Resolution: 224x224

One of the paper's key findings is that the scale of pre-training data is critical for ViT's performance. With ImageNet-1K alone, ViT falls behind CNNs, but at the JFT-300M scale, it surpasses them.

7.2 Fine-tuning

During fine-tuning, the pre-trained Classification Head is removed and a new Linear Layer matching the target task is attached.

Key technique for high-resolution fine-tuning:

Fine-tuning at a higher resolution (384x384 or 512x512) than pre-training (224x224) improves performance. However, when resolution changes, the number of patches (sequence length) changes, so pre-trained Position Embeddings cannot be used directly.

To solve this, 2D Interpolation is applied. The pre-trained Position Embeddings are restructured into their original 2D grid form, then resized to the new resolution using Bicubic Interpolation.

Example: 224x224 pre-training (14x14 grid) -> 384x384 fine-tuning (24x24 grid)

Pre-trained Position Embedding (14x14 = 196)
     |
     v
Restructure to 2D grid (14 x 14)
     |
     v
Bicubic Interpolation -> (24 x 24)
     |
     v
Flatten back to 1D (24x24 = 576)

This Position Embedding Interpolation has been adopted as a standard technique in virtually all Vision Transformers after ViT.

7.3 Training Cost

ModelPre-train DataTPUv3-core-days
ViT-B/16JFT-300MNot provided
ViT-L/16JFT-300MNot provided
ViT-H/14JFT-300M2,500
BiT-L (ResNet152x4)JFT-300M9,900
Noisy Student (EfficientNet-L2)JFT-300M + ImageNet12,300

ViT-H/14 achieved higher performance at approximately 1/4 the training cost of BiT-L and 1/5 that of Noisy Student. This demonstrates the Transformer's excellent scaling efficiency.


8. Experimental Results

8.1 Main Benchmark Results

The key results reported in Table 2 of the paper are summarized below.

ModelPre-trainImageNetImageNet-ReaLCIFAR-10CIFAR-100Oxford PetsOxford Flowers
ViT-H/14JFT-300M88.5590.7299.5094.5597.5699.68
ViT-L/16JFT-300M87.7690.5499.4293.9097.3299.74
ViT-L/16ImageNet-21K85.3088.6299.1593.2594.6799.61
BiT-L (ResNet152x4)JFT-300M87.5490.5499.3793.5196.6299.63
Noisy StudentJFT-300M88.490.55----

8.2 VTAB Benchmark

VTAB (Visual Task Adaptation Benchmark) categorizes 19 diverse vision tasks into Natural, Specialized, and Structured categories. It evaluates model generalization ability using only 1,000 training samples per task.

ModelNaturalSpecializedStructuredOverall
ViT-H/14 (JFT)79.3984.2369.2777.63
ViT-L/16 (JFT)76.2883.3664.7274.78
BiT-L (JFT)76.2984.9266.5175.90

ViT-H/14 achieved the best performance in the Natural and Structured categories and was nearly on par with BiT-L in the Specialized category. The overall VTAB score of 77.63 was the highest at the time.

8.3 Relationship Between Pre-training Data Scale and Performance

One of the paper's most important experiments analyzed performance changes with pre-training data scale (Figures 3, 4).

Pre-training DataViT-L/16 ImageNet AccBiT-L ImageNet AccWinner
ImageNet-1K (~1.3M)~76.5% (scratch)~80% (scratch)BiT (CNN)
ImageNet-21K (~14M)85.30%84.02%ViT
JFT-300M (~303M)87.76%87.54%ViT

These results reveal a clear pattern:

  • Small-scale data: CNN's inductive biases (Locality, Translation Equivariance) work advantageously, giving CNNs the edge
  • Large-scale data: Transformers learn these patterns directly from data, surpassing CNNs

9. Key Findings and Insights

9.1 The Double-Edged Sword of Inductive Bias

CNN's inductive biases -- Locality and Translation Equivariance -- serve as effective regularization with small-scale data, guiding the model to learn correct features even with limited data.

However, when data is sufficiently abundant, these inductive biases become shackles that limit the model's representational capacity. Transformers, without special structural assumptions, can learn more general and flexible patterns including Locality and Translation Equivariance through Self-Attention.

This reaffirms a longstanding lesson in AI: given sufficient data, a more general (less biased) model will outperform a more specialized (more biased) model.

9.2 Attention Map Visualization

9.2.1 Position Embedding Similarity

Visualizing the cosine similarity of learned Position Embeddings (Figure 7, left), each patch position's embedding shows high similarity with spatially adjacent patches. Furthermore, distinct similarity patterns appear between patches in the same row or column.

This proves that despite using only 1D Position Embeddings, the model automatically learned 2D spatial structure.

9.2.2 Attention Distance

Figure 7 (right) analyzes the mean Attention Distance per Attention Head across Transformer layers. Attention Distance is the average pixel distance between Query-Key patches, weighted by Attention Weights.

Key findings:

  • Lower Layers: Some heads attend to adjacent patches, others to distant patches -- local and global patterns coexist, similar to early CNN layers
  • Higher Layers: Most heads distribute attention across wide ranges -- integrating global information

This demonstrates that ViT is fundamentally different from CNNs in that it can leverage global information from the very first layer. In CNNs, the receptive field gradually expands across layers, but in ViT, attention over the entire image is possible from the first layer.

9.3 Representation Quality Analysis

In Linear Probing experiments (training only a Linear Classifier on frozen features), ViT showed relatively lower performance compared to CNNs, but achieved higher performance in fine-tuning. This suggests ViT learns a different type of feature representation -- ViT's features may contain richer information that is activated through fine-tuning.


10. PyTorch Core Implementation

Below is a PyTorch implementation of ViT's core structure, faithfully reflecting the paper's implementation while being written for clarity.

10.1 Patch Embedding

import torch
import torch.nn as nn


class PatchEmbedding(nn.Module):
    """Splits images into patches and embeds them via Linear Projection."""

    def __init__(
        self,
        img_size: int = 224,
        patch_size: int = 16,
        in_channels: int = 3,
        embed_dim: int = 768,
    ):
        super().__init__()
        self.img_size = img_size
        self.patch_size = patch_size
        self.num_patches = (img_size // patch_size) ** 2

        # Patch Embedding using Conv2d
        # Setting stride = patch_size makes it equivalent to Linear Projection
        self.projection = nn.Conv2d(
            in_channels,
            embed_dim,
            kernel_size=patch_size,
            stride=patch_size,
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (B, C, H, W)
        # projection output: (B, embed_dim, H/P, W/P)
        x = self.projection(x)
        # flatten spatial dims and transpose: (B, num_patches, embed_dim)
        x = x.flatten(2).transpose(1, 2)
        return x

10.2 Multi-head Self-Attention

class MultiHeadSelfAttention(nn.Module):
    """Multi-Head Self-Attention module."""

    def __init__(self, embed_dim: int = 768, num_heads: int = 12, dropout: float = 0.0):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.scale = self.head_dim ** -0.5  # 1/sqrt(d_k)

        # Generate Q, K, V all at once
        self.qkv = nn.Linear(embed_dim, embed_dim * 3)
        self.proj = nn.Linear(embed_dim, embed_dim)
        self.attn_dropout = nn.Dropout(dropout)
        self.proj_dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, N, D = x.shape

        # QKV generation: (B, N, 3*D) -> (B, N, 3, num_heads, head_dim)
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, B, num_heads, N, head_dim)
        q, k, v = qkv.unbind(0)  # each (B, num_heads, N, head_dim)

        # Scaled Dot-Product Attention
        attn = (q @ k.transpose(-2, -1)) * self.scale  # (B, num_heads, N, N)
        attn = attn.softmax(dim=-1)
        attn = self.attn_dropout(attn)

        # Weighted sum with Values
        x = (attn @ v).transpose(1, 2).reshape(B, N, D)  # (B, N, D)
        x = self.proj(x)
        x = self.proj_dropout(x)
        return x

10.3 Transformer Encoder Block

class TransformerBlock(nn.Module):
    """ViT Transformer Encoder Block (Pre-norm)."""

    def __init__(
        self,
        embed_dim: int = 768,
        num_heads: int = 12,
        mlp_ratio: float = 4.0,
        dropout: float = 0.0,
    ):
        super().__init__()
        self.norm1 = nn.LayerNorm(embed_dim)
        self.attn = MultiHeadSelfAttention(embed_dim, num_heads, dropout)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, int(embed_dim * mlp_ratio)),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(int(embed_dim * mlp_ratio), embed_dim),
            nn.Dropout(dropout),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Pre-norm + MSA + Residual
        x = x + self.attn(self.norm1(x))
        # Pre-norm + MLP + Residual
        x = x + self.mlp(self.norm2(x))
        return x

10.4 Full ViT Model

class VisionTransformer(nn.Module):
    """Vision Transformer (ViT) full model."""

    def __init__(
        self,
        img_size: int = 224,
        patch_size: int = 16,
        in_channels: int = 3,
        num_classes: int = 1000,
        embed_dim: int = 768,
        depth: int = 12,
        num_heads: int = 12,
        mlp_ratio: float = 4.0,
        dropout: float = 0.0,
    ):
        super().__init__()
        self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
        num_patches = self.patch_embed.num_patches

        # Learnable [CLS] Token
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        # Learnable Position Embedding ([CLS] + patches)
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
        self.pos_dropout = nn.Dropout(dropout)

        # Transformer Encoder
        self.blocks = nn.Sequential(
            *[
                TransformerBlock(embed_dim, num_heads, mlp_ratio, dropout)
                for _ in range(depth)
            ]
        )
        self.norm = nn.LayerNorm(embed_dim)

        # Classification Head
        self.head = nn.Linear(embed_dim, num_classes)

        # Initialization
        nn.init.trunc_normal_(self.pos_embed, std=0.02)
        nn.init.trunc_normal_(self.cls_token, std=0.02)
        self.apply(self._init_weights)

    def _init_weights(self, m: nn.Module):
        if isinstance(m, nn.Linear):
            nn.init.trunc_normal_(m.weight, std=0.02)
            if m.bias is not None:
                nn.init.zeros_(m.bias)
        elif isinstance(m, nn.LayerNorm):
            nn.init.ones_(m.weight)
            nn.init.zeros_(m.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B = x.shape[0]

        # Step 1: Patch Embedding
        x = self.patch_embed(x)  # (B, num_patches, embed_dim)

        # Step 2: [CLS] Token Prepend
        cls_tokens = self.cls_token.expand(B, -1, -1)  # (B, 1, embed_dim)
        x = torch.cat([cls_tokens, x], dim=1)  # (B, num_patches + 1, embed_dim)

        # Step 3: Position Embedding
        x = x + self.pos_embed
        x = self.pos_dropout(x)

        # Step 4: Transformer Encoder
        x = self.blocks(x)
        x = self.norm(x)

        # Step 5: Classification via [CLS] Token
        cls_output = x[:, 0]  # (B, embed_dim)
        logits = self.head(cls_output)  # (B, num_classes)
        return logits


# Model variant creation functions
def vit_base_patch16_224(**kwargs):
    return VisionTransformer(
        img_size=224, patch_size=16, embed_dim=768,
        depth=12, num_heads=12, **kwargs,
    )

def vit_large_patch16_224(**kwargs):
    return VisionTransformer(
        img_size=224, patch_size=16, embed_dim=1024,
        depth=24, num_heads=16, **kwargs,
    )

def vit_huge_patch14_224(**kwargs):
    return VisionTransformer(
        img_size=224, patch_size=14, embed_dim=1280,
        depth=32, num_heads=16, **kwargs,
    )

10.5 Usage Example

# Create ViT-Base/16 model
model = vit_base_patch16_224(num_classes=1000)

# Check parameter count
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
# Output: Total parameters: 86,567,656

# Inference
dummy_input = torch.randn(1, 3, 224, 224)
output = model(dummy_input)
print(f"Output shape: {output.shape}")
# Output: Output shape: torch.Size([1, 1000])

11. Comprehensive Survey of Follow-up Research

ViT became the starting point for the massive research wave of Vision Transformers. The major follow-up works are summarized chronologically below.

11.1 DeiT: Data-efficient Image Transformers (2020.12)

Paper: "Training data-efficient image transformers & distillation through attention" (Touvron et al., Facebook AI)

Key Contribution: Overcame ViT's greatest limitation -- dependence on large-scale data. Presented methods to effectively train ViT using only ImageNet-1K.

Key Techniques:

  • Knowledge Distillation Token: Added a separate Distillation Token alongside the [CLS] token to effectively transfer knowledge from a teacher model (CNN)
  • Hard Distillation: Trained the student using the teacher's hard labels (argmax predictions)
  • Interestingly, CNN teachers were more effective than Transformer teachers -- CNN's inductive biases were transferred to the Transformer through distillation

Results:

  • DeiT-B: 83.1% top-1 accuracy on ImageNet-1K (without external data)
  • DeiT-B distilled: 85.2% (using RegNetY-16GF teacher)
  • Surpassed ViT-B/16's JFT-300M pre-trained 84.15% without external data

11.2 Swin Transformer (2021.03)

Paper: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (Liu et al., Microsoft Research)

Key Contribution: Simultaneously resolved two fundamental limitations of ViT -- O(N2)O(N^2) complexity and single-scale features.

Key Techniques:

  • Window-based Self-Attention: Splits the image into fixed-size windows (7x7 patches) and performs Self-Attention only within windows. Complexity reduces from O(N2)O(N^2) to O(N)O(N) (linear with respect to image size)
  • Shifted Window: Shifts windows in consecutive layers to enable information exchange between windows
  • Hierarchical Feature Map: Merges patches (Patch Merging) as layers deepen, halving resolution and doubling channels. Provides multi-scale features similar to CNN's Feature Pyramid

Results:

  • Swin-L: 87.3% top-1 accuracy on ImageNet-1K (ImageNet-22K pre-training)
  • COCO Object Detection: 58.7 box AP, 51.1 mask AP
  • ADE20K Semantic Segmentation: 53.5 mIoU

Swin Transformer became a more versatile backbone than ViT for dense prediction tasks such as Object Detection and Semantic Segmentation.

11.3 BEiT: BERT Pre-Training of Image Transformers (2021.06)

Paper: "BEiT: BERT Pre-Training of Image Transformers" (Bao et al., Microsoft Research)

Key Contribution: Successfully applied NLP's BERT-style Masked Language Modeling to Vision for the first time.

Key Techniques:

  • Masked Image Modeling (MIM): Masks a portion (approximately 40%) of image patches and predicts the Visual Tokens of masked patches
  • Visual Tokenizer: Uses a dVAE (discrete Variational Autoencoder) to convert images into discrete Visual Tokens -- corresponding to NLP's vocabulary
  • Two views: Original image patches (input) and Visual Tokens (prediction target)

Results:

  • BEiT-B: 83.2% on ImageNet-1K (self-supervised pre-training + fine-tuning)
  • BEiT-L: 86.3% on ImageNet-1K (using only ImageNet-1K data)
  • +1.4% improvement over DeiT-B's 81.8% (same Base model size)

11.4 MAE: Masked Autoencoders (2021.11)

Paper: "Masked Autoencoders Are Scalable Vision Learners" (He et al., Facebook AI Research)

Key Contribution: Maximized the efficiency and scalability of self-supervised vision pre-training. Presented a simple yet powerful pre-training framework.

Key Techniques:

  • High masking ratio: Masks 75% of input patches -- much higher than NLP (15%). This is due to the high information redundancy in images
  • Asymmetric Encoder-Decoder: Encoder processes only unmasked patches (25% of total), maximizing efficiency. A lightweight decoder reconstructs masked pixels
  • Pixel-level reconstruction: Unlike BEiT, directly reconstructs raw pixel values of masked patches without a Visual Tokenizer

Results:

  • MAE (ViT-H): 87.8% on ImageNet-1K (using only ImageNet-1K data)
  • Training efficiency: Encoder processes only 25% of total patches, reducing training time by more than 3x
  • Far superior to training ViT from scratch without pre-training

11.5 DINO / DINOv2 (2021.04 / 2023.04)

DINO Paper: "Emerging Properties in Self-Supervised Vision Transformers" (Caron et al., Facebook AI Research)

DINOv2 Paper: "DINOv2: Learning Robust Visual Features without Supervision" (Oquab et al., Meta AI)

DINO Key Techniques:

  • Self-Distillation: Teacher and Student share the same network architecture, with the Teacher being an Exponential Moving Average (EMA) of the Student
  • Multi-crop strategy: Cross-feeds Global Views (full image) and Local Views (small crops)
  • Discovery: Self-supervised ViT's Self-Attention Maps acquire object segmentation capability without explicit training

DINOv2 Key Techniques:

  • Automated pipeline for building a large-scale curated dataset (LVD-142M)
  • Trains a 1B parameter teacher model, then distills to smaller student models via Knowledge Distillation
  • Learns universal visual features without text, without labels

Results:

  • DINO (ViT-B): 80.1% on ImageNet Linear Probing
  • DINOv2: Universal Visual Features surpassing OpenCLIP on most benchmarks

11.6 EVA / EVA-02 (2022 / 2023)

EVA Paper: "EVA: Exploring the Limits of Masked Visual Representation Learning at Scale" (Fang et al., BAAI)

EVA-02 Paper: "EVA-02: A Visual Representation for Neon Genesis" (Fang et al., BAAI)

Key Techniques:

  • Uses Masked Image Modeling + CLIP Features as reconstruction targets
  • Learns language-aligned visual features, strong in vision-language tasks
  • Presents techniques for efficiently training large-scale ViTs

Results:

  • EVA (ViT-g): 89.6% on ImageNet (336x336)
  • EVA-02: Achieved 90.0% on ImageNet with 304M parameters (using only public data)
  • EVA-02-CLIP: Zero-shot ImageNet 80.4% (1/6 the parameters of the previous best CLIP)

11.7 ConvNeXt: The CNN Strikes Back (2022.01)

Paper: "A ConvNet for the 2020s" (Liu et al., Facebook AI Research / UC Berkeley)

Key Contribution: Proved that systematically applying Transformer design principles to a pure CNN enables CNNs to achieve performance on par with Transformers. A compelling counterargument to "are CNNs truly inferior?"

Transformer design elements applied to ResNet:

  1. Macro Design: Adjusted Stage Ratio (ResNet's 3:4:6:3 -> Swin-T's 1:1:3:1)
  2. Replaced Stem with Patchify (4x4 Conv, stride 4)
  3. ResNeXt-style Grouped Convolution -> Depthwise Convolution
  4. Inverted Bottleneck (MobileNetV2 style)
  5. 7x7 Large Kernel (corresponding to Swin Transformer's 7x7 window)
  6. BN -> LayerNorm, ReLU -> GELU and other activation/normalization changes

Results:

  • ConvNeXt-B: 85.1% on ImageNet-1K (+0.6% over Swin-B's 84.5%, 12.5% faster inference)
  • ConvNeXt-L: 87.8% with ImageNet-22K pre-training
  • Matched or surpassed Swin Transformer not only in performance but also in throughput

12. ViT vs CNN vs Hybrid Comparison

12.1 Comprehensive Comparison Table

CharacteristicCNN (ResNet family)ViT (Pure Transformer)Hybrid (CNN + Transformer)
Inductive BiasStrong (Locality, Translation Equivariance)Nearly noneModerate (some from CNN)
Small-scale Data PerformanceExcellentInferiorExcellent
Large-scale Data PerformanceGoodBestVery good
Complexity (vs. Resolution)O(N)O(N)O(N2)O(N^2)O(N)O(N2)O(N) \sim O(N^2)
Multi-scale FeaturesNatural (Feature Pyramid)Absent (Single-scale)Varies
Global Receptive FieldRequires stacking layersAvailable from first layerAvailable after CNN
Dense Prediction SuitabilityHighLow (post-processing needed)Medium to high
Efficiency (Performance/FLOPs)GoodBest at large scaleGood
Implementation MaturityVery highRapidly maturingMedium
Representative ModelsResNet, EfficientNet, ConvNeXtViT, DeiT, BEiTSwin Transformer, CoAtNet
TaskRecommended ArchitectureRationale
Image Classification (Large-scale)ViT + MAE/DINO pre-trainingBest performance at scale
Image Classification (Small-scale)DeiT (Distillation) or ConvNeXtData efficiency
Object DetectionSwin Transformer + FPN familyMulti-scale features required
Semantic SegmentationSwin / SegFormer / DINOv2Suited for dense prediction
Vision-LanguageViT + CLIP-style pre-trainingLanguage-aligned features
Edge/Mobile DeploymentEfficientNet / MobileViTLightweight required
Self-supervised Pre-trainingMAE / DINOv2No labels needed, scalable

13. The Future of Computer Vision: Foundation Models

13.1 The Emergence of Vision Foundation Models

The ultimate outcome of the paradigm shift triggered by ViT is the emergence of Vision Foundation Models. Just as foundation models like GPT-3 and GPT-4 in NLP handle diverse tasks with a single model, the same trend is underway in Vision.

Major Vision Foundation Models:

  • SAM (Segment Anything Model): Based on ViT-H, handles all types of segmentation with a single model
  • DINOv2: Self-supervised ViT, a universal Visual Feature Extractor
  • CLIP/SigLIP: Vision-Language alignment, Zero-shot Classification and Retrieval
  • Florence/Intern: Large-scale multi-task Vision-Language models

13.2 Future Research Directions

Efficiency Improvements:

  • Overcoming the O(N2)O(N^2) bottleneck with FlashAttention, Linear Attention, etc.
  • Removing unnecessary patches with Token Pruning/Merging
  • Creating lightweight models through Knowledge Distillation

Learning Paradigms:

  • Proliferation of Self-supervised Pre-training (MAE, DINO families)
  • Vision-Language Alignment (CLIP family)
  • Reinforcement Learning-based visual decision-making (VLM + RL)

Architectural Innovation:

  • Applying Mamba / State Space Models to Vision (Vision Mamba, VMamba)
  • Efficient scaling using Mixture of Experts (MoE)
  • Continued research on hybrid architectures combining CNN and Transformer strengths

13.3 Historical Significance of ViT

The most important lesson ViT left behind is the universality of architecture. The fact that a single architecture (Transformer) can be applied to text, images, audio, video, code, and every modality is a truly remarkable event in AI history.

Before ViT, NLP and Vision had completely different architectural ecosystems. After ViT, the Transformer established itself as a true Universal Architecture, which is the technical foundation enabling today's Multimodal Foundation Models (GPT-4V, Gemini, Claude, etc.).

"An Image is Worth 16x16 Words" -- this title was not merely a metaphor, but a profound declaration that Vision and Language can be unified within the same framework.


14. References

  1. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. arXiv:2010.11929

  2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762

  3. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jegou, H. (2021). Training data-efficient image transformers & distillation through attention (DeiT). ICML 2021. arXiv:2012.12877

  4. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021. arXiv:2103.14030

  5. Bao, H., Dong, L., Piao, S., & Wei, F. (2021). BEiT: BERT Pre-Training of Image Transformers. ICLR 2022. arXiv:2106.08254

  6. He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners (MAE). CVPR 2022. arXiv:2111.06377

  7. Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers (DINO). ICCV 2021. arXiv:2104.14294

  8. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., ... & Bojanowski, P. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv:2304.07193

  9. Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., ... & Cao, Y. (2023). EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. CVPR 2023. arXiv:2211.07636

  10. Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., & Cao, Y. (2023). EVA-02: A Visual Representation for Neon Genesis. arXiv:2303.11331

  11. Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s (ConvNeXt). CVPR 2022. arXiv:2201.03545

  12. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. arXiv:1810.04805

  13. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition (ResNet). CVPR 2016. arXiv:1512.03385

  14. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., & Houlsby, N. (2020). Big Transfer (BiT): General Visual Representation Learning. ECCV 2020. arXiv:1912.11370

  15. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021. arXiv:2103.00020

  16. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., ... & Girshick, R. (2023). Segment Anything (SAM). ICCV 2023. arXiv:2304.02643