- Authors
- Name
- 1. Paper Overview
- 2. Background: Computer Vision Dominated by CNNs
- 3. Core Idea: Converting Images into Patch Sequences
- 4. Detailed Architecture Analysis
- 5. Mathematical Analysis: From Patches to Predictions
- 6. Model Variants: ViT-Base, Large, Huge
- 7. Training Strategy
- 8. Experimental Results
- 9. Key Findings and Insights
- 10. PyTorch Core Implementation
- 11. Comprehensive Survey of Follow-up Research
- 12. ViT vs CNN vs Hybrid Comparison
- 13. The Future of Computer Vision: Foundation Models
- 14. References
1. Paper Overview
"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" was published in October 2020 by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and others from Google Research, and was officially accepted at ICLR 2021. This paper demonstrated that the Transformer architecture, which achieved overwhelming success in NLP, could be applied directly to the image classification task in Computer Vision with virtually no modifications.
The core idea is remarkably simple: divide an image into fixed-size patches, treat each patch as a token (analogous to tokens in NLP), and feed them into a standard Transformer Encoder. The paper's title itself intuitively captures this idea -- an image is made up of 16x16-sized words.
The results were groundbreaking. The ViT-H/14 model, pre-trained on the large-scale JFT-300M dataset, achieved 88.55% top-1 accuracy on ImageNet, surpassing the existing CNN-based SOTA (State-of-the-Art) models BiT (Big Transfer) and Noisy Student. More importantly, the training cost required to achieve this was significantly lower than existing methods.
This paper triggered a paradigm shift in the field of Computer Vision. After ViT, numerous follow-up studies emerged -- DeiT, Swin Transformer, BEiT, MAE, DINO, and many more -- giving birth to an entirely new research field called "Vision Transformer."
2. Background: Computer Vision Dominated by CNNs
2.1 The Golden Age of Convolutional Neural Networks
After AlexNet's ImageNet Challenge victory in 2012, CNNs were effectively the only architecture in Computer Vision. The lineage of VGGNet (2014), GoogLeNet/Inception (2014), ResNet (2015), DenseNet (2017), and EfficientNet (2019) all shared convolution operations as their core.
The fundamental reasons CNNs succeeded in vision tasks lie in two inductive biases.
First, Locality. Convolution filters process only local regions of the input. This architecturally encodes the fact that adjacent pixels in images have strong correlations.
Second, Translation Equivariance. The same filter is shared and applied across all positions in the image, enabling the extraction of identical features regardless of where an object is located.
Thanks to these two biases, CNNs could effectively learn visual patterns even with relatively small amounts of data.
2.2 The Transformer Revolution in NLP
After the publication of "Attention Is All You Need" in 2017, the NLP field was completely transformed. Transformer-based models including BERT (2018), GPT-2 (2019), T5 (2019), and GPT-3 (2020) dominated virtually every NLP benchmark.
The key strengths of the Transformer are:
- Global Receptive Field through Self-Attention: Direct interaction between any two positions in the sequence is possible
- Excellent Parallelism: Unlike RNNs, all positions in the sequence can be processed simultaneously
- Outstanding Scalability: Performance steadily improves as model size and data increase, exhibiting Scaling Laws
2.3 Previous Attempts to Apply Transformers to Vision
Prior to ViT, there were attempts to introduce attention mechanisms into Vision:
- Non-local Neural Networks (Wang et al., 2018): Inserted Self-Attention blocks within CNNs to capture long-range dependencies
- Stand-Alone Self-Attention (Ramachandran et al., 2019): Replaced convolutions with local self-attention
- DETR (Carion et al., 2020): Leveraged Transformer Decoder for object detection
However, all of these either hybridized CNN and Attention or used specially designed attention mechanisms. ViT's innovation lies in directly applying the standard Transformer to Vision with minimal modifications, without such compromises.
3. Core Idea: Converting Images into Patch Sequences
3.1 Why Images Cannot Be Fed Directly into Transformers
The standard Transformer takes a 1D token sequence as input. Suppose we feed an image pixel by pixel into a Transformer. A 224x224 resolution image has 50,176 pixels. Since the computational complexity of Self-Attention is with respect to sequence length , Self-Attention over 50,176 tokens would require approximately 2.5 billion operations. This is practically infeasible.
3.2 Reducing Sequence Length by Splitting into Patches
ViT's solution is elegant yet intuitive: divide the image into fixed-size patches and treat each patch as a single token.
When splitting a 224x224 image into 16x16 patches, the number of patches is:
Instead of 50,176 pixels, only 196 patch tokens need to be processed. Self-Attention complexity becomes , a reduction of approximately 65,000 times compared to the pixel-level approach.
The key premise of this transformation is: a single 16x16 patch can function as a semantic unit equivalent to a single word in NLP. Just as natural language is a sequence of discrete tokens called words, images can be represented as a sequence of visual tokens called patches.
3.3 Correspondence with NLP
| NLP | Vision (ViT) |
|---|---|
| Sentence | Image |
| Word/Token | Image Patch |
| Vocabulary | Space of possible patch patterns |
| Token Embedding | Patch Embedding (Linear Projection) |
| Position Embedding | Position Embedding (1D Learnable) |
| [CLS] Token | [CLS] Token |
| Transformer Encoder | Transformer Encoder (identical) |
This correspondence is essentially all there is to ViT. The essence of ViT is bringing the proven NLP Transformer structure over with minimal changes.
4. Detailed Architecture Analysis
4.1 Overall Pipeline Overview
The complete processing pipeline of ViT is as follows:
- Split the input image into fixed-size patches
- Flatten each patch into a 1D vector
- Transform patch vectors into D-dimensional embeddings via Linear Projection
- Prepend a learnable [CLS] token to the sequence
- Add Position Embeddings to inject positional information
- Feed into a standard Transformer Encoder
- Perform classification using the final output of the [CLS] token
Input Image (224x224x3)
|
v
Patch Split (196 patches of 16x16x3)
|
v
Flatten (196 vectors of 768 dimensions)
|
v
Linear Projection (196 D-dimensional Embeddings)
|
v
[CLS] Token Prepend (197 D-dimensional vectors)
|
v
Add Position Embedding (197 D-dimensional vectors)
|
v
Transformer Encoder (L blocks)
|
v
Extract [CLS] Token Output
|
v
Classification Head (MLP) -> Class Prediction
4.2 Patch Embedding: Turning Images into Tokens
4.2.1 Patch Splitting and Flattening
The input image is split into patches. Each patch is flattened to .
For example, splitting a 224x224x3 (RGB) image into 16x16 patches:
- Number of patches:
- Dimension of each flattened patch vector:
4.2.2 Linear Projection
A learnable Linear Projection is applied to each flattened patch vector, mapping it to a D-dimensional embedding space.
Here is the Transformer's Hidden Dimension. For ViT-Base where , the Linear Projection becomes a 768x768 matrix.
An interesting point is that this Linear Projection is mathematically equivalent to a convolution with stride equal to the patch size. That is, it is identical to Conv2d(in_channels=3, out_channels=D, kernel_size=P, stride=P). In practice, implementations typically use convolution as it is more efficient.
4.3 [CLS] Token and Classification Head
4.3.1 [CLS] Token
Borrowed from BERT, a learnable special token is prepended to the patch embedding sequence.
This [CLS] token interacts with all patch tokens through the Transformer's Self-Attention, learning a global representation of the entire image. The output of this [CLS] token from the final layer of the Transformer Encoder serves as the vector representing the entire image.
4.3.2 Classification Head
During pre-training, an MLP Head with one hidden layer is attached to the [CLS] token output. During fine-tuning, only a single Linear Layer is used.
The paper also experimented with Global Average Pooling (GAP) of the patch tokens instead of the [CLS] token, reporting similar performance. The [CLS] token was adopted as the default to maintain consistency with the NLP Transformer.
4.4 Position Embedding: Injecting Positional Information
4.4.1 Why Position Embedding Is Needed
Self-Attention is inherently permutation invariant. Reordering the input tokens simply reorders the output accordingly. However, the spatial location of patches in images carries important information. A patch in the top-left and one in the bottom-right of an image have different spatial meanings.
4.4.2 1D Learnable Position Embedding
ViT uses learnable 1D Position Embeddings , where is the total sequence length including the [CLS] token.
The paper conducted comparison experiments with 2D Position Embedding (encoding patch row/column positions separately), finding no significant performance difference between 1D and 2D Position Embeddings. This suggests that ViT can learn 2D spatial structure from 1D Position Embeddings on its own.
In fact, when visualizing the learned Position Embeddings (Figure 7 in the paper), spatially adjacent patches show high cosine similarity in their Position Embeddings, with row and column structures naturally emerging. This is an impressive result demonstrating that the model automatically learned 2D positional relationships from data.
4.5 Transformer Encoder
4.5.1 Architecture
ViT uses the standard Transformer Encoder as-is. Each block has the following structure:
Where:
- LN: Layer Normalization (Pre-norm style, different from the original Transformer's Post-norm)
- MSA: Multi-head Self-Attention
- MLP: Feed-Forward Network
- Residual Connection: Input is added to each sub-block's output
What distinguishes ViT from the original Transformer is the adoption of Pre-norm. In the original "Attention Is All You Need," Layer Normalization was applied after the sub-layer output (Post-norm), but ViT applies it before the sub-layer input (Pre-norm). Pre-norm is known to provide better training stability.
4.5.2 Multi-head Self-Attention (MSA)
MSA splits the input into heads, performs Self-Attention independently, and then combines them.
Each head dimension is . For ViT-Base where and , .
4.5.3 MLP (Feed-Forward Network)
The MLP consists of two Linear Layers with a GELU activation function.
The hidden dimension is typically set to 4 times the Embedding Dimension (). For ViT-Base, this is .
The use of GELU (Gaussian Error Linear Unit) instead of the ReLU used in the original Transformer follows BERT's design.
5. Mathematical Analysis: From Patches to Predictions
The entire process can be summarized mathematically as follows.
5.1 Input Processing
Given an image :
Step 1. Patch splitting and flattening:
Step 2. Patch Embedding + [CLS] Token + Position Embedding:
Where ,
5.2 Transformer Encoder
Step 3. Repeat through Transformer blocks:
5.3 Output
Step 4. Classification via [CLS] token output:
5.4 Computational Complexity Analysis
For sequence length and embedding dimension , the complexities of key operations are:
| Operation | Complexity |
|---|---|
| Patch Embedding | |
| Self-Attention (QKV generation) | |
| Self-Attention (Attention computation) | |
| MLP | |
| Total Transformer Block |
The complexity with respect to sequence length is the key bottleneck limiting ViT's resolution scalability. This motivated the introduction of Windowed Attention in subsequent works like Swin Transformer.
6. Model Variants: ViT-Base, Large, Huge
Following BERT's design conventions, ViT defines three model sizes. The number after the slash in the model name indicates the patch size (e.g., ViT-B/16 uses 16x16 patches).
| Specification | ViT-Base (ViT-B) | ViT-Large (ViT-L) | ViT-Huge (ViT-H) |
|---|---|---|---|
| Layers () | 12 | 24 | 32 |
| Hidden Dim () | 768 | 1024 | 1280 |
| MLP Dim | 3072 | 4096 | 5120 |
| Attention Heads () | 12 | 16 | 16 |
| Parameters | ~86M | ~307M | ~632M |
| Head Dim () | 64 | 64 | 80 |
6.1 Variants by Patch Size
Even with the same architecture, performance and computational cost vary significantly depending on patch size.
| Model | Patch Size | Sequence Length (224x224) | Sequence Length (384x384) |
|---|---|---|---|
| ViT-B/32 | 32x32 | 49 | 144 |
| ViT-B/16 | 16x16 | 196 | 576 |
| ViT-L/16 | 16x16 | 196 | 576 |
| ViT-H/14 | 14x14 | 256 | 784 |
Smaller patch sizes capture more fine-grained visual information, but the increased sequence length causes computational costs to grow as . ViT-H/14's 14x14 patches generate a sequence approximately 27 times longer than ViT-B/32's 32x32 patches.
6.2 Hybrid Model
The paper also experimented with Hybrid models combining CNN and Transformer. Intermediate feature maps from ResNet are used instead of patch embeddings. For example, the stage 4 output (14x14 feature map) of ResNet-50 is treated as 1x1 patches and fed into the Transformer.
Experimental results showed that the Hybrid model outperformed pure ViT when pre-training data was limited, but pure ViT caught up when data scale was sufficiently large.
7. Training Strategy
7.1 Pre-training
ViT's training strategy follows the NLP paradigm of "pre-training + fine-tuning."
Datasets:
- ImageNet-1K: Approximately 1.3 million images, 1,000 classes
- ImageNet-21K: Approximately 14 million images, 21,843 classes
- JFT-300M: Google's internal dataset, approximately 300 million images, 18,291 classes
Pre-training configuration:
- Optimizer: Adam (, )
- Batch Size: 4,096
- Weight Decay: 0.1
- Linear Learning Rate Warmup + Cosine Decay
- Resolution: 224x224
One of the paper's key findings is that the scale of pre-training data is critical for ViT's performance. With ImageNet-1K alone, ViT falls behind CNNs, but at the JFT-300M scale, it surpasses them.
7.2 Fine-tuning
During fine-tuning, the pre-trained Classification Head is removed and a new Linear Layer matching the target task is attached.
Key technique for high-resolution fine-tuning:
Fine-tuning at a higher resolution (384x384 or 512x512) than pre-training (224x224) improves performance. However, when resolution changes, the number of patches (sequence length) changes, so pre-trained Position Embeddings cannot be used directly.
To solve this, 2D Interpolation is applied. The pre-trained Position Embeddings are restructured into their original 2D grid form, then resized to the new resolution using Bicubic Interpolation.
Example: 224x224 pre-training (14x14 grid) -> 384x384 fine-tuning (24x24 grid)
Pre-trained Position Embedding (14x14 = 196)
|
v
Restructure to 2D grid (14 x 14)
|
v
Bicubic Interpolation -> (24 x 24)
|
v
Flatten back to 1D (24x24 = 576)
This Position Embedding Interpolation has been adopted as a standard technique in virtually all Vision Transformers after ViT.
7.3 Training Cost
| Model | Pre-train Data | TPUv3-core-days |
|---|---|---|
| ViT-B/16 | JFT-300M | Not provided |
| ViT-L/16 | JFT-300M | Not provided |
| ViT-H/14 | JFT-300M | 2,500 |
| BiT-L (ResNet152x4) | JFT-300M | 9,900 |
| Noisy Student (EfficientNet-L2) | JFT-300M + ImageNet | 12,300 |
ViT-H/14 achieved higher performance at approximately 1/4 the training cost of BiT-L and 1/5 that of Noisy Student. This demonstrates the Transformer's excellent scaling efficiency.
8. Experimental Results
8.1 Main Benchmark Results
The key results reported in Table 2 of the paper are summarized below.
| Model | Pre-train | ImageNet | ImageNet-ReaL | CIFAR-10 | CIFAR-100 | Oxford Pets | Oxford Flowers |
|---|---|---|---|---|---|---|---|
| ViT-H/14 | JFT-300M | 88.55 | 90.72 | 99.50 | 94.55 | 97.56 | 99.68 |
| ViT-L/16 | JFT-300M | 87.76 | 90.54 | 99.42 | 93.90 | 97.32 | 99.74 |
| ViT-L/16 | ImageNet-21K | 85.30 | 88.62 | 99.15 | 93.25 | 94.67 | 99.61 |
| BiT-L (ResNet152x4) | JFT-300M | 87.54 | 90.54 | 99.37 | 93.51 | 96.62 | 99.63 |
| Noisy Student | JFT-300M | 88.4 | 90.55 | - | - | - | - |
8.2 VTAB Benchmark
VTAB (Visual Task Adaptation Benchmark) categorizes 19 diverse vision tasks into Natural, Specialized, and Structured categories. It evaluates model generalization ability using only 1,000 training samples per task.
| Model | Natural | Specialized | Structured | Overall |
|---|---|---|---|---|
| ViT-H/14 (JFT) | 79.39 | 84.23 | 69.27 | 77.63 |
| ViT-L/16 (JFT) | 76.28 | 83.36 | 64.72 | 74.78 |
| BiT-L (JFT) | 76.29 | 84.92 | 66.51 | 75.90 |
ViT-H/14 achieved the best performance in the Natural and Structured categories and was nearly on par with BiT-L in the Specialized category. The overall VTAB score of 77.63 was the highest at the time.
8.3 Relationship Between Pre-training Data Scale and Performance
One of the paper's most important experiments analyzed performance changes with pre-training data scale (Figures 3, 4).
| Pre-training Data | ViT-L/16 ImageNet Acc | BiT-L ImageNet Acc | Winner |
|---|---|---|---|
| ImageNet-1K (~1.3M) | ~76.5% (scratch) | ~80% (scratch) | BiT (CNN) |
| ImageNet-21K (~14M) | 85.30% | 84.02% | ViT |
| JFT-300M (~303M) | 87.76% | 87.54% | ViT |
These results reveal a clear pattern:
- Small-scale data: CNN's inductive biases (Locality, Translation Equivariance) work advantageously, giving CNNs the edge
- Large-scale data: Transformers learn these patterns directly from data, surpassing CNNs
9. Key Findings and Insights
9.1 The Double-Edged Sword of Inductive Bias
CNN's inductive biases -- Locality and Translation Equivariance -- serve as effective regularization with small-scale data, guiding the model to learn correct features even with limited data.
However, when data is sufficiently abundant, these inductive biases become shackles that limit the model's representational capacity. Transformers, without special structural assumptions, can learn more general and flexible patterns including Locality and Translation Equivariance through Self-Attention.
This reaffirms a longstanding lesson in AI: given sufficient data, a more general (less biased) model will outperform a more specialized (more biased) model.
9.2 Attention Map Visualization
9.2.1 Position Embedding Similarity
Visualizing the cosine similarity of learned Position Embeddings (Figure 7, left), each patch position's embedding shows high similarity with spatially adjacent patches. Furthermore, distinct similarity patterns appear between patches in the same row or column.
This proves that despite using only 1D Position Embeddings, the model automatically learned 2D spatial structure.
9.2.2 Attention Distance
Figure 7 (right) analyzes the mean Attention Distance per Attention Head across Transformer layers. Attention Distance is the average pixel distance between Query-Key patches, weighted by Attention Weights.
Key findings:
- Lower Layers: Some heads attend to adjacent patches, others to distant patches -- local and global patterns coexist, similar to early CNN layers
- Higher Layers: Most heads distribute attention across wide ranges -- integrating global information
This demonstrates that ViT is fundamentally different from CNNs in that it can leverage global information from the very first layer. In CNNs, the receptive field gradually expands across layers, but in ViT, attention over the entire image is possible from the first layer.
9.3 Representation Quality Analysis
In Linear Probing experiments (training only a Linear Classifier on frozen features), ViT showed relatively lower performance compared to CNNs, but achieved higher performance in fine-tuning. This suggests ViT learns a different type of feature representation -- ViT's features may contain richer information that is activated through fine-tuning.
10. PyTorch Core Implementation
Below is a PyTorch implementation of ViT's core structure, faithfully reflecting the paper's implementation while being written for clarity.
10.1 Patch Embedding
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
"""Splits images into patches and embeds them via Linear Projection."""
def __init__(
self,
img_size: int = 224,
patch_size: int = 16,
in_channels: int = 3,
embed_dim: int = 768,
):
super().__init__()
self.img_size = img_size
self.patch_size = patch_size
self.num_patches = (img_size // patch_size) ** 2
# Patch Embedding using Conv2d
# Setting stride = patch_size makes it equivalent to Linear Projection
self.projection = nn.Conv2d(
in_channels,
embed_dim,
kernel_size=patch_size,
stride=patch_size,
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: (B, C, H, W)
# projection output: (B, embed_dim, H/P, W/P)
x = self.projection(x)
# flatten spatial dims and transpose: (B, num_patches, embed_dim)
x = x.flatten(2).transpose(1, 2)
return x
10.2 Multi-head Self-Attention
class MultiHeadSelfAttention(nn.Module):
"""Multi-Head Self-Attention module."""
def __init__(self, embed_dim: int = 768, num_heads: int = 12, dropout: float = 0.0):
super().__init__()
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5 # 1/sqrt(d_k)
# Generate Q, K, V all at once
self.qkv = nn.Linear(embed_dim, embed_dim * 3)
self.proj = nn.Linear(embed_dim, embed_dim)
self.attn_dropout = nn.Dropout(dropout)
self.proj_dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor) -> torch.Tensor:
B, N, D = x.shape
# QKV generation: (B, N, 3*D) -> (B, N, 3, num_heads, head_dim)
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4) # (3, B, num_heads, N, head_dim)
q, k, v = qkv.unbind(0) # each (B, num_heads, N, head_dim)
# Scaled Dot-Product Attention
attn = (q @ k.transpose(-2, -1)) * self.scale # (B, num_heads, N, N)
attn = attn.softmax(dim=-1)
attn = self.attn_dropout(attn)
# Weighted sum with Values
x = (attn @ v).transpose(1, 2).reshape(B, N, D) # (B, N, D)
x = self.proj(x)
x = self.proj_dropout(x)
return x
10.3 Transformer Encoder Block
class TransformerBlock(nn.Module):
"""ViT Transformer Encoder Block (Pre-norm)."""
def __init__(
self,
embed_dim: int = 768,
num_heads: int = 12,
mlp_ratio: float = 4.0,
dropout: float = 0.0,
):
super().__init__()
self.norm1 = nn.LayerNorm(embed_dim)
self.attn = MultiHeadSelfAttention(embed_dim, num_heads, dropout)
self.norm2 = nn.LayerNorm(embed_dim)
self.mlp = nn.Sequential(
nn.Linear(embed_dim, int(embed_dim * mlp_ratio)),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(int(embed_dim * mlp_ratio), embed_dim),
nn.Dropout(dropout),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Pre-norm + MSA + Residual
x = x + self.attn(self.norm1(x))
# Pre-norm + MLP + Residual
x = x + self.mlp(self.norm2(x))
return x
10.4 Full ViT Model
class VisionTransformer(nn.Module):
"""Vision Transformer (ViT) full model."""
def __init__(
self,
img_size: int = 224,
patch_size: int = 16,
in_channels: int = 3,
num_classes: int = 1000,
embed_dim: int = 768,
depth: int = 12,
num_heads: int = 12,
mlp_ratio: float = 4.0,
dropout: float = 0.0,
):
super().__init__()
self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
num_patches = self.patch_embed.num_patches
# Learnable [CLS] Token
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
# Learnable Position Embedding ([CLS] + patches)
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
self.pos_dropout = nn.Dropout(dropout)
# Transformer Encoder
self.blocks = nn.Sequential(
*[
TransformerBlock(embed_dim, num_heads, mlp_ratio, dropout)
for _ in range(depth)
]
)
self.norm = nn.LayerNorm(embed_dim)
# Classification Head
self.head = nn.Linear(embed_dim, num_classes)
# Initialization
nn.init.trunc_normal_(self.pos_embed, std=0.02)
nn.init.trunc_normal_(self.cls_token, std=0.02)
self.apply(self._init_weights)
def _init_weights(self, m: nn.Module):
if isinstance(m, nn.Linear):
nn.init.trunc_normal_(m.weight, std=0.02)
if m.bias is not None:
nn.init.zeros_(m.bias)
elif isinstance(m, nn.LayerNorm):
nn.init.ones_(m.weight)
nn.init.zeros_(m.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
B = x.shape[0]
# Step 1: Patch Embedding
x = self.patch_embed(x) # (B, num_patches, embed_dim)
# Step 2: [CLS] Token Prepend
cls_tokens = self.cls_token.expand(B, -1, -1) # (B, 1, embed_dim)
x = torch.cat([cls_tokens, x], dim=1) # (B, num_patches + 1, embed_dim)
# Step 3: Position Embedding
x = x + self.pos_embed
x = self.pos_dropout(x)
# Step 4: Transformer Encoder
x = self.blocks(x)
x = self.norm(x)
# Step 5: Classification via [CLS] Token
cls_output = x[:, 0] # (B, embed_dim)
logits = self.head(cls_output) # (B, num_classes)
return logits
# Model variant creation functions
def vit_base_patch16_224(**kwargs):
return VisionTransformer(
img_size=224, patch_size=16, embed_dim=768,
depth=12, num_heads=12, **kwargs,
)
def vit_large_patch16_224(**kwargs):
return VisionTransformer(
img_size=224, patch_size=16, embed_dim=1024,
depth=24, num_heads=16, **kwargs,
)
def vit_huge_patch14_224(**kwargs):
return VisionTransformer(
img_size=224, patch_size=14, embed_dim=1280,
depth=32, num_heads=16, **kwargs,
)
10.5 Usage Example
# Create ViT-Base/16 model
model = vit_base_patch16_224(num_classes=1000)
# Check parameter count
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
# Output: Total parameters: 86,567,656
# Inference
dummy_input = torch.randn(1, 3, 224, 224)
output = model(dummy_input)
print(f"Output shape: {output.shape}")
# Output: Output shape: torch.Size([1, 1000])
11. Comprehensive Survey of Follow-up Research
ViT became the starting point for the massive research wave of Vision Transformers. The major follow-up works are summarized chronologically below.
11.1 DeiT: Data-efficient Image Transformers (2020.12)
Paper: "Training data-efficient image transformers & distillation through attention" (Touvron et al., Facebook AI)
Key Contribution: Overcame ViT's greatest limitation -- dependence on large-scale data. Presented methods to effectively train ViT using only ImageNet-1K.
Key Techniques:
- Knowledge Distillation Token: Added a separate Distillation Token alongside the [CLS] token to effectively transfer knowledge from a teacher model (CNN)
- Hard Distillation: Trained the student using the teacher's hard labels (argmax predictions)
- Interestingly, CNN teachers were more effective than Transformer teachers -- CNN's inductive biases were transferred to the Transformer through distillation
Results:
- DeiT-B: 83.1% top-1 accuracy on ImageNet-1K (without external data)
- DeiT-B distilled: 85.2% (using RegNetY-16GF teacher)
- Surpassed ViT-B/16's JFT-300M pre-trained 84.15% without external data
11.2 Swin Transformer (2021.03)
Paper: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (Liu et al., Microsoft Research)
Key Contribution: Simultaneously resolved two fundamental limitations of ViT -- complexity and single-scale features.
Key Techniques:
- Window-based Self-Attention: Splits the image into fixed-size windows (7x7 patches) and performs Self-Attention only within windows. Complexity reduces from to (linear with respect to image size)
- Shifted Window: Shifts windows in consecutive layers to enable information exchange between windows
- Hierarchical Feature Map: Merges patches (Patch Merging) as layers deepen, halving resolution and doubling channels. Provides multi-scale features similar to CNN's Feature Pyramid
Results:
- Swin-L: 87.3% top-1 accuracy on ImageNet-1K (ImageNet-22K pre-training)
- COCO Object Detection: 58.7 box AP, 51.1 mask AP
- ADE20K Semantic Segmentation: 53.5 mIoU
Swin Transformer became a more versatile backbone than ViT for dense prediction tasks such as Object Detection and Semantic Segmentation.
11.3 BEiT: BERT Pre-Training of Image Transformers (2021.06)
Paper: "BEiT: BERT Pre-Training of Image Transformers" (Bao et al., Microsoft Research)
Key Contribution: Successfully applied NLP's BERT-style Masked Language Modeling to Vision for the first time.
Key Techniques:
- Masked Image Modeling (MIM): Masks a portion (approximately 40%) of image patches and predicts the Visual Tokens of masked patches
- Visual Tokenizer: Uses a dVAE (discrete Variational Autoencoder) to convert images into discrete Visual Tokens -- corresponding to NLP's vocabulary
- Two views: Original image patches (input) and Visual Tokens (prediction target)
Results:
- BEiT-B: 83.2% on ImageNet-1K (self-supervised pre-training + fine-tuning)
- BEiT-L: 86.3% on ImageNet-1K (using only ImageNet-1K data)
- +1.4% improvement over DeiT-B's 81.8% (same Base model size)
11.4 MAE: Masked Autoencoders (2021.11)
Paper: "Masked Autoencoders Are Scalable Vision Learners" (He et al., Facebook AI Research)
Key Contribution: Maximized the efficiency and scalability of self-supervised vision pre-training. Presented a simple yet powerful pre-training framework.
Key Techniques:
- High masking ratio: Masks 75% of input patches -- much higher than NLP (15%). This is due to the high information redundancy in images
- Asymmetric Encoder-Decoder: Encoder processes only unmasked patches (25% of total), maximizing efficiency. A lightweight decoder reconstructs masked pixels
- Pixel-level reconstruction: Unlike BEiT, directly reconstructs raw pixel values of masked patches without a Visual Tokenizer
Results:
- MAE (ViT-H): 87.8% on ImageNet-1K (using only ImageNet-1K data)
- Training efficiency: Encoder processes only 25% of total patches, reducing training time by more than 3x
- Far superior to training ViT from scratch without pre-training
11.5 DINO / DINOv2 (2021.04 / 2023.04)
DINO Paper: "Emerging Properties in Self-Supervised Vision Transformers" (Caron et al., Facebook AI Research)
DINOv2 Paper: "DINOv2: Learning Robust Visual Features without Supervision" (Oquab et al., Meta AI)
DINO Key Techniques:
- Self-Distillation: Teacher and Student share the same network architecture, with the Teacher being an Exponential Moving Average (EMA) of the Student
- Multi-crop strategy: Cross-feeds Global Views (full image) and Local Views (small crops)
- Discovery: Self-supervised ViT's Self-Attention Maps acquire object segmentation capability without explicit training
DINOv2 Key Techniques:
- Automated pipeline for building a large-scale curated dataset (LVD-142M)
- Trains a 1B parameter teacher model, then distills to smaller student models via Knowledge Distillation
- Learns universal visual features without text, without labels
Results:
- DINO (ViT-B): 80.1% on ImageNet Linear Probing
- DINOv2: Universal Visual Features surpassing OpenCLIP on most benchmarks
11.6 EVA / EVA-02 (2022 / 2023)
EVA Paper: "EVA: Exploring the Limits of Masked Visual Representation Learning at Scale" (Fang et al., BAAI)
EVA-02 Paper: "EVA-02: A Visual Representation for Neon Genesis" (Fang et al., BAAI)
Key Techniques:
- Uses Masked Image Modeling + CLIP Features as reconstruction targets
- Learns language-aligned visual features, strong in vision-language tasks
- Presents techniques for efficiently training large-scale ViTs
Results:
- EVA (ViT-g): 89.6% on ImageNet (336x336)
- EVA-02: Achieved 90.0% on ImageNet with 304M parameters (using only public data)
- EVA-02-CLIP: Zero-shot ImageNet 80.4% (1/6 the parameters of the previous best CLIP)
11.7 ConvNeXt: The CNN Strikes Back (2022.01)
Paper: "A ConvNet for the 2020s" (Liu et al., Facebook AI Research / UC Berkeley)
Key Contribution: Proved that systematically applying Transformer design principles to a pure CNN enables CNNs to achieve performance on par with Transformers. A compelling counterargument to "are CNNs truly inferior?"
Transformer design elements applied to ResNet:
- Macro Design: Adjusted Stage Ratio (ResNet's 3:4:6:3 -> Swin-T's 1:1:3:1)
- Replaced Stem with Patchify (4x4 Conv, stride 4)
- ResNeXt-style Grouped Convolution -> Depthwise Convolution
- Inverted Bottleneck (MobileNetV2 style)
- 7x7 Large Kernel (corresponding to Swin Transformer's 7x7 window)
- BN -> LayerNorm, ReLU -> GELU and other activation/normalization changes
Results:
- ConvNeXt-B: 85.1% on ImageNet-1K (+0.6% over Swin-B's 84.5%, 12.5% faster inference)
- ConvNeXt-L: 87.8% with ImageNet-22K pre-training
- Matched or surpassed Swin Transformer not only in performance but also in throughput
12. ViT vs CNN vs Hybrid Comparison
12.1 Comprehensive Comparison Table
| Characteristic | CNN (ResNet family) | ViT (Pure Transformer) | Hybrid (CNN + Transformer) |
|---|---|---|---|
| Inductive Bias | Strong (Locality, Translation Equivariance) | Nearly none | Moderate (some from CNN) |
| Small-scale Data Performance | Excellent | Inferior | Excellent |
| Large-scale Data Performance | Good | Best | Very good |
| Complexity (vs. Resolution) | |||
| Multi-scale Features | Natural (Feature Pyramid) | Absent (Single-scale) | Varies |
| Global Receptive Field | Requires stacking layers | Available from first layer | Available after CNN |
| Dense Prediction Suitability | High | Low (post-processing needed) | Medium to high |
| Efficiency (Performance/FLOPs) | Good | Best at large scale | Good |
| Implementation Maturity | Very high | Rapidly maturing | Medium |
| Representative Models | ResNet, EfficientNet, ConvNeXt | ViT, DeiT, BEiT | Swin Transformer, CoAtNet |
12.2 Recommended Architecture by Task (As of 2026)
| Task | Recommended Architecture | Rationale |
|---|---|---|
| Image Classification (Large-scale) | ViT + MAE/DINO pre-training | Best performance at scale |
| Image Classification (Small-scale) | DeiT (Distillation) or ConvNeXt | Data efficiency |
| Object Detection | Swin Transformer + FPN family | Multi-scale features required |
| Semantic Segmentation | Swin / SegFormer / DINOv2 | Suited for dense prediction |
| Vision-Language | ViT + CLIP-style pre-training | Language-aligned features |
| Edge/Mobile Deployment | EfficientNet / MobileViT | Lightweight required |
| Self-supervised Pre-training | MAE / DINOv2 | No labels needed, scalable |
13. The Future of Computer Vision: Foundation Models
13.1 The Emergence of Vision Foundation Models
The ultimate outcome of the paradigm shift triggered by ViT is the emergence of Vision Foundation Models. Just as foundation models like GPT-3 and GPT-4 in NLP handle diverse tasks with a single model, the same trend is underway in Vision.
Major Vision Foundation Models:
- SAM (Segment Anything Model): Based on ViT-H, handles all types of segmentation with a single model
- DINOv2: Self-supervised ViT, a universal Visual Feature Extractor
- CLIP/SigLIP: Vision-Language alignment, Zero-shot Classification and Retrieval
- Florence/Intern: Large-scale multi-task Vision-Language models
13.2 Future Research Directions
Efficiency Improvements:
- Overcoming the bottleneck with FlashAttention, Linear Attention, etc.
- Removing unnecessary patches with Token Pruning/Merging
- Creating lightweight models through Knowledge Distillation
Learning Paradigms:
- Proliferation of Self-supervised Pre-training (MAE, DINO families)
- Vision-Language Alignment (CLIP family)
- Reinforcement Learning-based visual decision-making (VLM + RL)
Architectural Innovation:
- Applying Mamba / State Space Models to Vision (Vision Mamba, VMamba)
- Efficient scaling using Mixture of Experts (MoE)
- Continued research on hybrid architectures combining CNN and Transformer strengths
13.3 Historical Significance of ViT
The most important lesson ViT left behind is the universality of architecture. The fact that a single architecture (Transformer) can be applied to text, images, audio, video, code, and every modality is a truly remarkable event in AI history.
Before ViT, NLP and Vision had completely different architectural ecosystems. After ViT, the Transformer established itself as a true Universal Architecture, which is the technical foundation enabling today's Multimodal Foundation Models (GPT-4V, Gemini, Claude, etc.).
"An Image is Worth 16x16 Words" -- this title was not merely a metaphor, but a profound declaration that Vision and Language can be unified within the same framework.
14. References
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. arXiv:2010.11929
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jegou, H. (2021). Training data-efficient image transformers & distillation through attention (DeiT). ICML 2021. arXiv:2012.12877
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021. arXiv:2103.14030
Bao, H., Dong, L., Piao, S., & Wei, F. (2021). BEiT: BERT Pre-Training of Image Transformers. ICLR 2022. arXiv:2106.08254
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners (MAE). CVPR 2022. arXiv:2111.06377
Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers (DINO). ICCV 2021. arXiv:2104.14294
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., ... & Bojanowski, P. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv:2304.07193
Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., ... & Cao, Y. (2023). EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. CVPR 2023. arXiv:2211.07636
Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., & Cao, Y. (2023). EVA-02: A Visual Representation for Neon Genesis. arXiv:2303.11331
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s (ConvNeXt). CVPR 2022. arXiv:2201.03545
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. arXiv:1810.04805
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition (ResNet). CVPR 2016. arXiv:1512.03385
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., & Houlsby, N. (2020). Big Transfer (BiT): General Visual Representation Learning. ECCV 2020. arXiv:1912.11370
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021. arXiv:2103.00020
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., ... & Girshick, R. (2023). Segment Anything (SAM). ICCV 2023. arXiv:2304.02643