Mastering the Segment Anything Model: Paper Analysis and Practical Guide from SAM 1 to SAM 2 to SAM 3

1. Overview: The Evolution of the Segment Anything Model
2. SAM 1: Segment Anything (2023)
3. SAM 2: Segment Anything in Images and Videos (2024)
4. SAM 3: Segment Anything with Concepts (2025)
5. Comprehensive Comparison of All Three Models
6. Practical Use Cases
7. Study Roadmap
8. References

1. Overview: The Evolution of the Segment Anything Model

Segment Anything Model (SAM) is a series of foundation models for image and video segmentation published by Meta AI Research. Just as GPT established the "prompt" paradigm in NLP, SAM introduced a new paradigm called Promptable Segmentation to computer vision segmentation.

Version	Released	Core Capability	Paper
SAM 1	2023.04 (ICCV 2023)	Image promptable segmentation	Segment Anything
SAM 2	2024.08	Image + real-time video segmentation	SAM 2: Segment Anything in Images and Videos
SAM 3	2025.11 (ICLR 2026)	Concept-aware segmentation	SAM 3: Segment Anything with Concepts

The core evolutionary direction of the three models can be summarized in one line:

SAM 1: "Where to segment?" → SAM 2: "Where — even in video?" → SAM 3: "What to segment?"

2. SAM 1: Segment Anything (2023)

2.1 Paper Information

Title: Segment Anything
Authors: Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao et al. (Meta AI Research)
Published: April 2023 (arXiv), accepted at ICCV 2023
Paper: arxiv.org/abs/2304.02643
GitHub: github.com/facebookresearch/segment-anything
License: Apache 2.0

2.2 Three Key Contributions

SAM 1 simultaneously introduced three things:

A new task — Promptable Segmentation: Given any prompt (point, box, mask, or text), the model returns a valid segmentation mask
A new model — SAM: A foundation model for prompt-based segmentation
A new dataset — SA-1B: The largest segmentation dataset ever, containing 1.1 billion masks from 11 million images

2.3 Architecture Details

SAM's architecture is decomposed into three components. The core design principle is to run the heavy image encoder only once and repeatedly call the lightweight prompt encoder and mask decoder in real time.

┌─────────────────────────────────────────────────────────┐
│                      SAM Architecture                    │
│                                                          │
│  ┌──────────────┐   ┌────────────────┐                   │
│  │ Image Encoder │   │ Prompt Encoder │                   │
│  │  (ViT-H/L/B) │   │ (Point/Box/    │                   │
│  │  MAE Pretrain │   │  Mask/Text)    │                   │
│  └──────┬───────┘   └───────┬────────┘                   │
│         │                   │                            │
│         └─────────┬─────────┘                            │
│                   ▼                                      │
│          ┌────────────────┐                              │
│          │  Mask Decoder   │                              │
│          │ (Transformer    │                              │
│          │  2-layer)       │──→ 3 masks + IoU scores      │
│          └────────────────┘                              │
└─────────────────────────────────────────────────────────┘

Image Encoder

Variant	Parameters	Checkpoint Size
ViT-B	91M	~375 MB
ViT-L	308M	~1.25 GB
ViT-H (default)	636M	~2.56 GB

Vision Transformer pretrained with MAE (Masked Autoencoder)
Input resolution: 1024×1024
Output: 64×64 image embedding (256-dim)
Runs only once per image — reused for each prompt afterwards

Prompt Encoder

Sparse prompts (points, boxes, text):

Points/boxes → Positional Encoding + learned type embeddings (foreground point vs. background point)
Text → processed via CLIP text encoder

Dense prompts (masks):

Embedded via convolution layers and element-wise summed with the image embedding

Mask Decoder

Modified Transformer Decoder (2 layers)
Embedding dimension: 256, MLP inner dimension: 2048
Ambiguity-aware output: predicts 3 candidate masks simultaneously for a single prompt
- Whole, Part, and Sub-part levels
- Each mask is assigned an IoU confidence score
~50ms inference per prompt on GPU — enabling real-time interaction

2.4 SA-1B Dataset

Item	Value
Images	11 million
Masks	1.1 billion (~100 per image on average)
Auto-generated	99.1%
Original resolution	~3300×4950
Dataset size	~5 TB (images) + ~20 GB (annotations)

Data Engine — 3-Phase Construction Process

Phase 1: Assisted-Manual        Phase 2: Semi-Automatic       Phase 3: Fully Automatic
─────────────────────          ──────────────────────         ──────────────────────
• SAM + human annotators        • SAM auto-suggests masks     • 32×32 grid points auto-generated
• Browser-based tool            • Humans add missed objects   • NMS + confidence filtering
• 120K images / 4.3M masks     • 180K images / 5.9M masks    • Majority of 1.1B masks

Quality verification: Human audit of 500 images (~50,000 masks) showed 94% achieving IoU > 0.90

2.5 Key Innovations

Zero-Shot Transfer: Instantly applicable to new domains without fine-tuning

Edge detection (BSDS500 ODS: 0.768)
Object proposal generation (LVIS AR@1000: 59.3)
Medical imaging, satellite imagery, underwater photography, microscopy images, etc.

Benchmark performance:

Zero-shot evaluation across 23 datasets: outperformed existing SOTA (RITM) on 16
With oracle selection: outperformed RITM on all 23
No performance disparity across demographic groups such as race/gender (fairness validation)

2.6 Installation and Usage

# Installation
pip install git+https://github.com/facebookresearch/segment-anything.git

# Optional dependencies
pip install opencv-python pycocotools matplotlib onnxruntime onnx

Checkpoint download:

Model	File
ViT-H (default)	`sam_vit_h_4b8939.pth`
ViT-L	`sam_vit_l_0b3195.pth`
ViT-B	`sam_vit_b_01ec64.pth`

Prompt-based segmentation:

from segment_anything import SamPredictor, sam_model_registry
import numpy as np

# Load model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")

# Create predictor
predictor = SamPredictor(sam)

# Set image (image embedding computed once)
predictor.set_image(image)  # numpy array (H, W, 3) RGB

# Predict with point prompt
masks, scores, logits = predictor.predict(
    point_coords=np.array([[500, 375]]),  # (N, 2) coordinates
    point_labels=np.array([1]),            # 1=foreground, 0=background
    multimask_output=True,                 # Return 3 candidate masks
)

Automatic mask generation (Segment Everything):

from segment_anything import SamAutomaticMaskGenerator, sam_model_registry

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")

mask_generator = SamAutomaticMaskGenerator(sam)
masks = mask_generator.generate(image)
# Each mask: segmentation, area, bbox, predicted_iou, stability_score

3. SAM 2: Segment Anything in Images and Videos (2024)

3.1 Paper Information

Title: SAM 2: Segment Anything in Images and Videos
Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu et al. (15 authors, Meta AI Research)
Published: August 2024 (arXiv), revised October 2024
Paper: arxiv.org/abs/2408.00714
GitHub: github.com/facebookresearch/sam2
Project page: ai.meta.com/sam2
License: Apache 2.0

3.2 What Changed from SAM 1

The core question behind SAM 2: "Can we extend image-only segmentation to video?"

Comparison	SAM 1	SAM 2
Input	Single image	Image + video
Image encoder	ViT (MAE pretrained)	Hiera (hierarchical, MAE pretrained)
Temporal modeling	None	Memory Attention mechanism
Inference speed	~50ms per prompt	6x faster on images
Occlusion handling	Not supported	Occlusion Head
Interaction	Prompts on images	Prompts on any frame in video

3.3 Architecture Details

┌────────────────────────────────────────────────────────────────────┐
│                    SAM 2 Streaming Architecture                    │
│                                                                    │
│  Frame t-2      Frame t-1      Frame t (current)                   │
│     │            │            │                                    │
│     ▼            ▼            ▼                                    │
│  ┌──────┐    ┌──────┐    ┌──────────────┐                          │
│  │Hiera │    │Hiera │    │ Hiera Image  │                          │
│  │Enc.  │    │Enc.  │    │   Encoder    │                          │
│  └──┬───┘    └──┬───┘    └──────┬───────┘                          │
│     │           │               │                                  │
│     ▼           ▼               ▼                                  │
│  ┌──────────────────────────────────────────┐                      │
│  │            Memory Bank                    │                      │
│  │  ┌──────────────┐  ┌───────────────────┐ │                      │
│  │  │ FIFO Queue   │  │ Prompted Frames   │ │                      │
│  │  │ (N recent    │  │ (M prompted       │ │                      │
│  │  │  frames)     │  │  frames)          │ │                      │
│  │  └──────────────┘  └───────────────────┘ │                      │
│  │  ┌──────────────────────────────────────┐│                      │
│  │  │     Object Pointers (256-dim)        ││                      │
│  │  └──────────────────────────────────────┘│                      │
│  └──────────────────┬───────────────────────┘                      │
│                     ▼                                              │
│  ┌──────────────────────────────┐   ┌───────────────┐              │
│  │    Memory Attention Module    │←──│Prompt Encoder │              │
│  │ (L Transformer Blocks:       │   │(Point/Box/    │              │
│  │  Self-Attn + Cross-Attn)     │   │ Mask)         │              │
│  └──────────────┬───────────────┘   └───────────────┘              │
│                 ▼                                                   │
│  ┌──────────────────────────┐                                      │
│  │      Mask Decoder        │──→ Mask + IoU + Occlusion Score      │
│  │ (+ Skip Connections)     │                                      │
│  └──────────────────────────┘                                      │
└────────────────────────────────────────────────────────────────────┘

Hiera Image Encoder

SAM 1's ViT was replaced with Hiera (hierarchical ViT).

Stages 3+4 (stride 16, 32) → fused via FPN → image embedding (input to Memory Attention)
Stages 1+2 (stride 4, 8) → skip connections to the mask decoder's upsampling layers
Uses windowed absolute positional embedding

Memory Mechanism (Key Innovation)

Memory Encoder: After processing each frame, frame features and predicted masks are stored in the Memory Bank (projected to 64-dim)

Memory Bank structure:

FIFO Queue: The most recent N non-prompted frames — maintains temporal context
Prompted Frames: Up to M prompted frames — preserves user guidance
Object Pointers: 256-dim vectors — high-level semantic information about tracked objects

Memory Attention Module: L Transformer blocks

Self-Attention: within the current frame features
Cross-Attention: interaction with past frames + object pointers from the Memory Bank

Occlusion Prediction Head

Handles situations in video where objects become occluded or leave the frame:

Outputs a visibility score per frame
Suppresses noisy predictions for occluded objects
On reappearance → immediate re-identification via the Memory Bank

3.4 SA-V Dataset

Item	Value
Videos	50,900
Total duration	~196 hours
Total frames	~4.2 million
Spatio-temporal masklets	642,600
Total masks	35.5 million
Resolution	240p to 4K (average 1,401×1,037)
Average length	~14 seconds
Scene distribution	Indoor 54%, Outdoor 46%
Geographic diversity	47 countries
Scale vs. existing	4.5x more video, 53x more annotations

3-Phase Data Engine:

Phase	Method	Speed	Output
Phase 1	SAM per Frame	37.8 sec/frame	16K masklets
Phase 2	SAM + SAM 2 Mask	7.4 sec/frame (5.1x faster)	63.5K masklets
Phase 3	Full SAM 2	Fastest	All remaining

3.5 Model Variants and Performance

SAM 2.1 (released September 2024, latest recommended version):

Model	Parameters	FPS (A100)	SA-V Test (J&F)	MOSE Val (J&F)
sam2.1_hiera_tiny	38.9M	91.2	76.5	71.8
sam2.1_hiera_small	46.0M	84.8	76.6	73.5
sam2.1_hiera_base_plus	80.8M	64.1	78.2	73.7
sam2.1_hiera_large	224.4M	39.5	79.5	74.6

Key benchmarks:

Benchmark	SAM 2 (J&F)	vs. Previous SOTA
MOSE val	77.9%	+6.2%
DAVIS 2017 val	90.7%	+2.6%

Image segmentation: 6x faster than SAM 1 while also improving accuracy
Interaction efficiency: 3x fewer user interactions to achieve the same quality

3.6 Installation and Usage

# Installation
git clone https://github.com/facebookresearch/sam2.git && cd sam2
pip install -e .

# Download checkpoints
cd checkpoints && ./download_ckpts.sh && cd ..

Image segmentation:

import torch
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(image)
    masks, _, _ = predictor.predict(
        point_coords=np.array([[500, 375]]),
        point_labels=np.array([1]),
    )

Video segmentation:

import torch
from sam2.build_sam import build_sam2_video_predictor

checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    state = predictor.init_state(video_path)

    # Add prompt to a specific frame
    frame_idx, object_ids, masks = predictor.add_new_points_or_box(
        state, frame_idx=0, obj_id=1,
        points=np.array([[500, 375]]),
        labels=np.array([1]),
    )

    # Propagate through entire video
    for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
        # Process masks for each frame
        process_masks(frame_idx, masks)

Hugging Face integration:

from sam2.sam2_image_predictor import SAM2ImagePredictor
predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")

from sam2.sam2_video_predictor import SAM2VideoPredictor
predictor = SAM2VideoPredictor.from_pretrained("facebook/sam2-hiera-large")

4. SAM 3: Segment Anything with Concepts (2025)

4.1 Paper Information

Title: SAM 3: Segment Anything with Concepts
Authors: Nicolas Carion, Laura Gustafson, Yuan-Ting Hu et al. (35 authors, Meta AI Research)
Published: November 20, 2025 (arXiv), accepted at ICLR 2026
Paper: arxiv.org/abs/2511.16719
GitHub: github.com/facebookresearch/sam3
Project page: ai.meta.com/sam3
License: Apache 2.0

4.2 Paradigm Shift: From Where to What

SAM 1 and SAM 2 relied on geometric prompts (points, boxes) that specify "where" to segment. SAM 3 asks a fundamentally different question: "What" to segment?

SAM 1: Point/Box → "Isolate the object at this location"
SAM 2: Point/Box + Video Tracking → "Track this object through the video"
SAM 3: Text/Image Exemplar → "Find everything matching this concept"

Promptable Concept Segmentation (PCS) — the new task proposed by SAM 3:

Text prompts: Natural language noun phrases (e.g., "yellow school bus")
Image exemplars: Visual example images
Combined prompts: Text + exemplar used simultaneously
Legacy prompts: Points, boxes, masks (backward compatible with SAM 2)

4.3 Architecture Details

Total parameters: 848M (~3.4 GB)

┌───────────────────────────────────────────────────────────────────────┐
│                       SAM 3 Architecture                              │
│                                                                       │
│  ┌────────────┐  ┌────────────┐  ┌──────────────────┐                 │
│  │ Text       │  │ Exemplar   │  │ Visual Prompts   │                 │
│  │ Encoder    │  │ Encoder    │  │ (Point/Box/Mask) │                 │
│  │ "school    │  │ [image]    │  │                  │                 │
│  │  bus"      │  │            │  │                  │                 │
│  └─────┬──────┘  └─────┬──────┘  └────────┬─────────┘                 │
│        └────────────┬───┘                  │                          │
│                     ▼                      │                          │
│  ┌──────────────────────────────┐          │                          │
│  │   Meta Perception Encoder    │          │                          │
│  │   (Vision-Language joint     │          │                          │
│  │    embedding space)          │          │                          │
│  └──────────────┬───────────────┘          │                          │
│                 │                          │                          │
│        ┌────────┴────────┐                 │                          │
│        ▼                 ▼                 │                          │
│  ┌───────────┐    ┌─────────────────┐      │                          │
│  │ Presence  │    │  Fusion Encoder │      │                          │
│  │   Head    │    │  (Conditional   │      │                          │
│  │ "yes/no"  │    │   features)     │      │                          │
│  └───────────┘    └────────┬────────┘      │                          │
│                            │               │                          │
│                            ▼               │                          │
│               ┌──────────────────┐         │                          │
│               │  DETR Detector   │←────────┘                          │
│               │  (Transformer-   │                                    │
│               │   based detect.) │                                    │
│               └────────┬─────────┘                                    │
│                        │                                              │
│                        ▼                                              │
│  ┌──────────────────────────────────────┐                             │
│  │    SAM 2-inspired Tracker            │                             │
│  │    (Memory Bank + temporal           │                             │
│  │     consistency)                     │                             │
│  └──────────────────┬───────────────────┘                             │
│                     ▼                                                 │
│              Masks + Boxes + Scores                                   │
└───────────────────────────────────────────────────────────────────────┘

Key Innovation: Presence Head

The Presence Head is SAM 3's most important architectural innovation. It decouples recognition from localization.

Previous approach: Detector simultaneously "finds" and "judges existence" → optimization conflict
SAM 3:            Presence Head first judges "does it exist?" → Detector focuses solely on "finding"

Impact:

Configuration	CGF1
Without Presence Head	57.6
With Presence Head	63.3 (+5.7, +9.9%)

Decoupled Detector-Tracker Design

Unlike SAM 2's monolithic architecture, SAM 3 separates detection and tracking into independent modules:

Shared: backbone (Perception Encoder)
Separate: decoder heads (detection vs. tracking)
Advantage: reduced inter-task interference, independent scaling

Training Pipeline (4 Stages)

Stage	Description
1	Perception Encoder pretraining
2	Detector pretraining
3	Detector fine-tuning with SA-Co data
4	Backbone frozen, then Tracker training

4.4 SA-Co Dataset

SAM 3 was trained on SA-Co (Segment Anything with Concepts), the largest and most diverse segmentation dataset ever.

Training Data

Subset	Contents
SA-Co/HQ	5.2M high-quality images, 4M unique noun phrases, 52M masks
SA-Co/SYN	38M synthetic phrases, 1.4 billion masks (large-scale pretraining)
SA-Co/VIDEO	52,500 videos, 24,800 unique concepts, 467,000+ masklets

Total ontology: 22 million entities (17 top-level categories, 72 subcategories)

Evaluation Benchmarks

Benchmark	Description
SA-Co/Gold	7 multi-review image subsets (highest quality)
SA-Co/Silver	10 domain-specific image subsets
SA-Co/VEval	3 video evaluation subsets

214,000 unique phrases, 126,000 images/videos — over 50x more diverse concepts than existing COCO/LVIS

Additional Public Datasets

SA-FARI: 10,000+ wildlife camera trap videos (100+ species annotated)
FathomNet: Underwater segmentation benchmark for marine imagery

4.5 Performance Benchmarks

Image Segmentation (Zero-Shot)

Benchmark	Metric	SAM 3	Previous Best	Improvement
LVIS	Mask AP	47.0	38.5 (T-Rex2)	+22.1%
COCO	Box AP	53.5	52.2	+2.5%
SA-Co/Gold	CGF1	65.0	34.3	+89.5% (~2x)
ODinW	vs T-Rex2	-	-	+20.1 AP

Video Segmentation

Benchmark	SAM 3 (J&F)	SAM 2.1 Large	Improvement
MOSEv2	60.1	47.9	+25.5%
DAVIS 2017	92.0	90.7	+1.4%
LVOSv2	88.2	79.6	+10.8%

Exemplar Utilization Effect

Prompt Configuration	CGF1	Improvement over Text Only
Text only	46.4	-
Text + 1 exemplar	57.6	+11.2
Text + 3 exemplars	65.0	+18.6

Speed and Efficiency

Inference speed: ~30ms per image on H200 GPU (including 100+ object detection)
Model size: 848M parameters (~3.4 GB)
Runnable on 16 GB GPUs
vs. human performance: achieves 88% of the human lower bound on SA-Co/Gold

4.6 Installation and Usage

Official Repository

# Create environment
conda create -n sam3 python=3.12
conda activate sam3

# Install PyTorch
pip install torch==2.7.0 torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/cu126

# Install SAM 3
git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .

# Optional: for notebooks or development
pip install -e ".[notebooks]"
pip install -e ".[train,dev]"

Image segmentation (text prompt):

from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

model = build_sam3_image_model()
processor = Sam3Processor(model)
image = Image.open("path/to/image.jpg")

# Concept segmentation with text prompt
inference_state = processor.set_image(image)
output = processor.set_text_prompt(
    state=inference_state,
    prompt="yellow school bus"
)
masks, boxes, scores = output["masks"], output["boxes"], output["scores"]

Video tracking:

from sam3.model_builder import build_sam3_video_predictor

video_predictor = build_sam3_video_predictor()
response = video_predictor.handle_request(
    request=dict(
        type="start_session",
        resource_path="path/to/video.mp4"
    )
)

Ultralytics Integration

pip install -U ultralytics

from ultralytics.models.sam import SAM3SemanticPredictor

overrides = dict(
    conf=0.25, task="segment", mode="predict",
    model="sam3.pt", half=True
)
predictor = SAM3SemanticPredictor(overrides=overrides)
predictor.set_image("path/to/image.jpg")

# Segment multiple concepts simultaneously
results = predictor(text=["person", "bus", "glasses"])

5. Comprehensive Comparison of All Three Models

5.1 Architecture Comparison

Item	SAM 1	SAM 2	SAM 3
Image encoder	ViT (MAE)	Hiera (MAE)	Meta Perception Encoder
Parameters	636M (ViT-H)	224.4M (Hiera-L)	848M
Prompts	Point, box, mask, text	Point, box, mask	Text, exemplar, point, box, mask
Temporal model	None	Memory Attention	Memory + Tracker
Detector	None	None	DETR-based
Occlusion	Not supported	Occlusion Head	Occlusion + Presence Head
Output	Mask + IoU	Mask + IoU + Occlusion	Mask + box + score + concept

5.2 Dataset Comparison

Item	SA-1B (SAM 1)	SA-V (SAM 2)	SA-Co (SAM 3)
Images	11M	-	5.2M (HQ)
Videos	-	50,900	52,500
Masks	1.1B	35.5M	1.4B+ (incl. SYN)
Concepts/Labels	None (class-agnostic)	None	22M entities
Unique phrases	-	-	4M

5.3 Performance Evolution

Benchmark	SAM 1	SAM 2/2.1	SAM 3
DAVIS 2017 (J&F)	N/A	90.7	92.0
MOSE val (J&F)	N/A	74.6	60.1 (MOSEv2, different benchmark)
Image speed	Baseline	6x faster	~30ms/image
Zero-shot ability	Geometric	Geometric + temporal	Concept-level

6. Practical Use Cases

6.1 When SAM 1 Is the Right Choice

When fast segmentation on single images is needed
Building interactive annotation tools
When simple integration into existing pipelines is needed (most mature ecosystem)

6.2 When SAM 2 Is the Right Choice

When video object tracking is needed
Real-time video stream analysis
Scenarios where occlusion handling is critical
When faster speeds than SAM 1 are needed for images as well

6.3 When SAM 3 Is the Right Choice

When segmenting via natural language search ("find all the cars")
Open-vocabulary object detection + segmentation
Visual exemplar-based few-shot segmentation
Large-scale automatic labeling of images/videos

6.4 Real-World Deployment Examples (SAM 3)

Instagram Edits app: Applying dynamic effects to specific people/objects in videos
Facebook Marketplace "View in Room": AR furniture placement (SAM 3 + SAM 3D)
Wildlife conservation: Camera trap monitoring based on the SA-FARI dataset

7. Study Roadmap

A step-by-step learning guide for deeply understanding and utilizing the SAM series.

7.1 Prerequisites

Order	Topic	Recommended Reading
1	Vision Transformer (ViT)	An Image is Worth 16x16 Words
2	MAE (Masked Autoencoder)	Masked Autoencoders Are Scalable Vision Learners
3	DETR	End-to-End Object Detection with Transformers
4	CLIP	Learning Transferable Visual Models
5	Hiera	Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

7.2 Recommended Paper Reading Order

1. SAM 1 paper (2023) → Establish foundational concepts
   ↓
2. SAM 2 paper (2024) → Understand video extension
   ↓
3. SAM 3 paper (2025) → Understand concept awareness
   ↓
4. Hands-on notebooks from each GitHub repository

7.3 Recommended Hands-On Exercises

Step	Exercise	Difficulty
1	SAM 1 automatic mask generation (Colab notebook)	Easy
2	SAM 2 video tracking walkthrough	Intermediate
3	SAM 3 text-prompt segmentation	Intermediate
4	SAM + Grounding DINO pipeline construction	Hard
5	Custom dataset fine-tuning	Hard
6	ONNX/TensorRT conversion and edge deployment	Advanced

Model	Characteristics	Paper
EfficientSAM	Lightweight SAM (SAMI distill.)	arxiv.org/abs/2312.00863
FastSAM	CNN-based real-time SAM	arxiv.org/abs/2306.12156
MobileSAM	Mobile-optimized SAM	arxiv.org/abs/2306.14289
Grounded SAM	Grounding DINO + SAM	github.com/IDEA-Research/Grounded-Segment-Anything
SAM-HQ	High-quality mask SAM	arxiv.org/abs/2306.01567
SAM 3D	3D object/body reconstruction (Meta)	ai.meta.com/sam3

8. References

Core Papers

Kirillov, A., Mintun, E., Ravi, N., et al. (2023). "Segment Anything". ICCV 2023. arxiv.org/abs/2304.02643
Ravi, N., Gabeur, V., Hu, Y.-T., et al. (2024). "SAM 2: Segment Anything in Images and Videos". arxiv.org/abs/2408.00714
Carion, N., Gustafson, L., Hu, Y.-T., et al. (2025). "SAM 3: Segment Anything with Concepts". ICLR 2026. arxiv.org/abs/2511.16719

Prerequisite Papers

Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". ICLR 2021. arxiv.org/abs/2010.11929
He, K., et al. (2022). "Masked Autoencoders Are Scalable Vision Learners". CVPR 2022. arxiv.org/abs/2111.06377
Carion, N., et al. (2020). "End-to-End Object Detection with Transformers (DETR)". ECCV 2020. arxiv.org/abs/2005.12872
Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision (CLIP)". ICML 2021. arxiv.org/abs/2103.00020
Ryali, C., et al. (2023). "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles". ICML 2023. arxiv.org/abs/2306.00989

GitHub Repositories

github.com/facebookresearch/segment-anything — SAM 1
github.com/facebookresearch/sam2 — SAM 2
github.com/facebookresearch/sam3 — SAM 3

Tutorials and Guides

Ultralytics SAM Documentation — docs.ultralytics.com/models/sam
Ultralytics SAM 2 Documentation — docs.ultralytics.com/models/sam-2
Ultralytics SAM 3 Documentation — docs.ultralytics.com/models/sam-3
Encord Blog: Segment Anything Model Explained — encord.com/blog/segment-anything-model-explained
Roboflow: What is SAM 3 — blog.roboflow.com/what-is-sam3
Meta AI Blog: Introducing SAM 3 — ai.meta.com/blog/segment-anything-model-3

Project Pages and Datasets

Meta AI SAM 2 Project — ai.meta.com/sam2
Meta AI SAM 3 Project — ai.meta.com/sam3
SA-1B Dataset — ai.meta.com/datasets/segment-anything
SA-V Dataset — ai.meta.com/datasets/segment-anything-video