- Authors
- Name
- 1. Overview: The Evolution of the Segment Anything Model
- 2. SAM 1: Segment Anything (2023)
- 3. SAM 2: Segment Anything in Images and Videos (2024)
- 4. SAM 3: Segment Anything with Concepts (2025)
- 5. Comprehensive Comparison of All Three Models
- 6. Practical Use Cases
- 7. Study Roadmap
- 8. References
1. Overview: The Evolution of the Segment Anything Model
Segment Anything Model (SAM) is a series of foundation models for image and video segmentation published by Meta AI Research. Just as GPT established the "prompt" paradigm in NLP, SAM introduced a new paradigm called Promptable Segmentation to computer vision segmentation.
| Version | Released | Core Capability | Paper |
|---|---|---|---|
| SAM 1 | 2023.04 (ICCV 2023) | Image promptable segmentation | Segment Anything |
| SAM 2 | 2024.08 | Image + real-time video segmentation | SAM 2: Segment Anything in Images and Videos |
| SAM 3 | 2025.11 (ICLR 2026) | Concept-aware segmentation | SAM 3: Segment Anything with Concepts |
The core evolutionary direction of the three models can be summarized in one line:
SAM 1: "Where to segment?" → SAM 2: "Where — even in video?" → SAM 3: "What to segment?"
2. SAM 1: Segment Anything (2023)
2.1 Paper Information
- Title: Segment Anything
- Authors: Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao et al. (Meta AI Research)
- Published: April 2023 (arXiv), accepted at ICCV 2023
- Paper: arxiv.org/abs/2304.02643
- GitHub: github.com/facebookresearch/segment-anything
- License: Apache 2.0
2.2 Three Key Contributions
SAM 1 simultaneously introduced three things:
- A new task — Promptable Segmentation: Given any prompt (point, box, mask, or text), the model returns a valid segmentation mask
- A new model — SAM: A foundation model for prompt-based segmentation
- A new dataset — SA-1B: The largest segmentation dataset ever, containing 1.1 billion masks from 11 million images
2.3 Architecture Details
SAM's architecture is decomposed into three components. The core design principle is to run the heavy image encoder only once and repeatedly call the lightweight prompt encoder and mask decoder in real time.
┌─────────────────────────────────────────────────────────┐
│ SAM Architecture │
│ │
│ ┌──────────────┐ ┌────────────────┐ │
│ │ Image Encoder │ │ Prompt Encoder │ │
│ │ (ViT-H/L/B) │ │ (Point/Box/ │ │
│ │ MAE Pretrain │ │ Mask/Text) │ │
│ └──────┬───────┘ └───────┬────────┘ │
│ │ │ │
│ └─────────┬─────────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Mask Decoder │ │
│ │ (Transformer │ │
│ │ 2-layer) │──→ 3 masks + IoU scores │
│ └────────────────┘ │
└─────────────────────────────────────────────────────────┘
Image Encoder
| Variant | Parameters | Checkpoint Size |
|---|---|---|
| ViT-B | 91M | ~375 MB |
| ViT-L | 308M | ~1.25 GB |
| ViT-H (default) | 636M | ~2.56 GB |
- Vision Transformer pretrained with MAE (Masked Autoencoder)
- Input resolution: 1024×1024
- Output: 64×64 image embedding (256-dim)
- Runs only once per image — reused for each prompt afterwards
Prompt Encoder
Sparse prompts (points, boxes, text):
- Points/boxes → Positional Encoding + learned type embeddings (foreground point vs. background point)
- Text → processed via CLIP text encoder
Dense prompts (masks):
- Embedded via convolution layers and element-wise summed with the image embedding
Mask Decoder
- Modified Transformer Decoder (2 layers)
- Embedding dimension: 256, MLP inner dimension: 2048
- Ambiguity-aware output: predicts 3 candidate masks simultaneously for a single prompt
- Whole, Part, and Sub-part levels
- Each mask is assigned an IoU confidence score
- ~50ms inference per prompt on GPU — enabling real-time interaction
2.4 SA-1B Dataset
| Item | Value |
|---|---|
| Images | 11 million |
| Masks | 1.1 billion (~100 per image on average) |
| Auto-generated | 99.1% |
| Original resolution | ~3300×4950 |
| Dataset size | ~5 TB (images) + ~20 GB (annotations) |
Data Engine — 3-Phase Construction Process
Phase 1: Assisted-Manual Phase 2: Semi-Automatic Phase 3: Fully Automatic
───────────────────── ────────────────────── ──────────────────────
• SAM + human annotators • SAM auto-suggests masks • 32×32 grid points auto-generated
• Browser-based tool • Humans add missed objects • NMS + confidence filtering
• 120K images / 4.3M masks • 180K images / 5.9M masks • Majority of 1.1B masks
Quality verification: Human audit of 500 images (~50,000 masks) showed 94% achieving IoU > 0.90
2.5 Key Innovations
Zero-Shot Transfer: Instantly applicable to new domains without fine-tuning
- Edge detection (BSDS500 ODS: 0.768)
- Object proposal generation (LVIS AR@1000: 59.3)
- Medical imaging, satellite imagery, underwater photography, microscopy images, etc.
Benchmark performance:
- Zero-shot evaluation across 23 datasets: outperformed existing SOTA (RITM) on 16
- With oracle selection: outperformed RITM on all 23
- No performance disparity across demographic groups such as race/gender (fairness validation)
2.6 Installation and Usage
# Installation
pip install git+https://github.com/facebookresearch/segment-anything.git
# Optional dependencies
pip install opencv-python pycocotools matplotlib onnxruntime onnx
Checkpoint download:
| Model | File |
|---|---|
| ViT-H (default) | sam_vit_h_4b8939.pth |
| ViT-L | sam_vit_l_0b3195.pth |
| ViT-B | sam_vit_b_01ec64.pth |
Prompt-based segmentation:
from segment_anything import SamPredictor, sam_model_registry
import numpy as np
# Load model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
# Create predictor
predictor = SamPredictor(sam)
# Set image (image embedding computed once)
predictor.set_image(image) # numpy array (H, W, 3) RGB
# Predict with point prompt
masks, scores, logits = predictor.predict(
point_coords=np.array([[500, 375]]), # (N, 2) coordinates
point_labels=np.array([1]), # 1=foreground, 0=background
multimask_output=True, # Return 3 candidate masks
)
Automatic mask generation (Segment Everything):
from segment_anything import SamAutomaticMaskGenerator, sam_model_registry
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
mask_generator = SamAutomaticMaskGenerator(sam)
masks = mask_generator.generate(image)
# Each mask: segmentation, area, bbox, predicted_iou, stability_score
3. SAM 2: Segment Anything in Images and Videos (2024)
3.1 Paper Information
- Title: SAM 2: Segment Anything in Images and Videos
- Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu et al. (15 authors, Meta AI Research)
- Published: August 2024 (arXiv), revised October 2024
- Paper: arxiv.org/abs/2408.00714
- GitHub: github.com/facebookresearch/sam2
- Project page: ai.meta.com/sam2
- License: Apache 2.0
3.2 What Changed from SAM 1
The core question behind SAM 2: "Can we extend image-only segmentation to video?"
| Comparison | SAM 1 | SAM 2 |
|---|---|---|
| Input | Single image | Image + video |
| Image encoder | ViT (MAE pretrained) | Hiera (hierarchical, MAE pretrained) |
| Temporal modeling | None | Memory Attention mechanism |
| Inference speed | ~50ms per prompt | 6x faster on images |
| Occlusion handling | Not supported | Occlusion Head |
| Interaction | Prompts on images | Prompts on any frame in video |
3.3 Architecture Details
┌────────────────────────────────────────────────────────────────────┐
│ SAM 2 Streaming Architecture │
│ │
│ Frame t-2 Frame t-1 Frame t (current) │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────┐ ┌──────┐ ┌──────────────┐ │
│ │Hiera │ │Hiera │ │ Hiera Image │ │
│ │Enc. │ │Enc. │ │ Encoder │ │
│ └──┬───┘ └──┬───┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Memory Bank │ │
│ │ ┌──────────────┐ ┌───────────────────┐ │ │
│ │ │ FIFO Queue │ │ Prompted Frames │ │ │
│ │ │ (N recent │ │ (M prompted │ │ │
│ │ │ frames) │ │ frames) │ │ │
│ │ └──────────────┘ └───────────────────┘ │ │
│ │ ┌──────────────────────────────────────┐│ │
│ │ │ Object Pointers (256-dim) ││ │
│ │ └──────────────────────────────────────┘│ │
│ └──────────────────┬───────────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────┐ ┌───────────────┐ │
│ │ Memory Attention Module │←──│Prompt Encoder │ │
│ │ (L Transformer Blocks: │ │(Point/Box/ │ │
│ │ Self-Attn + Cross-Attn) │ │ Mask) │ │
│ └──────────────┬───────────────┘ └───────────────┘ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Mask Decoder │──→ Mask + IoU + Occlusion Score │
│ │ (+ Skip Connections) │ │
│ └──────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
Hiera Image Encoder
SAM 1's ViT was replaced with Hiera (hierarchical ViT).
- Stages 3+4 (stride 16, 32) → fused via FPN → image embedding (input to Memory Attention)
- Stages 1+2 (stride 4, 8) → skip connections to the mask decoder's upsampling layers
- Uses windowed absolute positional embedding
Memory Mechanism (Key Innovation)
Memory Encoder: After processing each frame, frame features and predicted masks are stored in the Memory Bank (projected to 64-dim)
Memory Bank structure:
- FIFO Queue: The most recent N non-prompted frames — maintains temporal context
- Prompted Frames: Up to M prompted frames — preserves user guidance
- Object Pointers: 256-dim vectors — high-level semantic information about tracked objects
Memory Attention Module: L Transformer blocks
- Self-Attention: within the current frame features
- Cross-Attention: interaction with past frames + object pointers from the Memory Bank
Occlusion Prediction Head
Handles situations in video where objects become occluded or leave the frame:
- Outputs a visibility score per frame
- Suppresses noisy predictions for occluded objects
- On reappearance → immediate re-identification via the Memory Bank
3.4 SA-V Dataset
| Item | Value |
|---|---|
| Videos | 50,900 |
| Total duration | ~196 hours |
| Total frames | ~4.2 million |
| Spatio-temporal masklets | 642,600 |
| Total masks | 35.5 million |
| Resolution | 240p to 4K (average 1,401×1,037) |
| Average length | ~14 seconds |
| Scene distribution | Indoor 54%, Outdoor 46% |
| Geographic diversity | 47 countries |
| Scale vs. existing | 4.5x more video, 53x more annotations |
3-Phase Data Engine:
| Phase | Method | Speed | Output |
|---|---|---|---|
| Phase 1 | SAM per Frame | 37.8 sec/frame | 16K masklets |
| Phase 2 | SAM + SAM 2 Mask | 7.4 sec/frame (5.1x faster) | 63.5K masklets |
| Phase 3 | Full SAM 2 | Fastest | All remaining |
3.5 Model Variants and Performance
SAM 2.1 (released September 2024, latest recommended version):
| Model | Parameters | FPS (A100) | SA-V Test (J&F) | MOSE Val (J&F) |
|---|---|---|---|---|
| sam2.1_hiera_tiny | 38.9M | 91.2 | 76.5 | 71.8 |
| sam2.1_hiera_small | 46.0M | 84.8 | 76.6 | 73.5 |
| sam2.1_hiera_base_plus | 80.8M | 64.1 | 78.2 | 73.7 |
| sam2.1_hiera_large | 224.4M | 39.5 | 79.5 | 74.6 |
Key benchmarks:
| Benchmark | SAM 2 (J&F) | vs. Previous SOTA |
|---|---|---|
| MOSE val | 77.9% | +6.2% |
| DAVIS 2017 val | 90.7% | +2.6% |
- Image segmentation: 6x faster than SAM 1 while also improving accuracy
- Interaction efficiency: 3x fewer user interactions to achieve the same quality
3.6 Installation and Usage
# Installation
git clone https://github.com/facebookresearch/sam2.git && cd sam2
pip install -e .
# Download checkpoints
cd checkpoints && ./download_ckpts.sh && cd ..
Image segmentation:
import torch
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(image)
masks, _, _ = predictor.predict(
point_coords=np.array([[500, 375]]),
point_labels=np.array([1]),
)
Video segmentation:
import torch
from sam2.build_sam import build_sam2_video_predictor
checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
state = predictor.init_state(video_path)
# Add prompt to a specific frame
frame_idx, object_ids, masks = predictor.add_new_points_or_box(
state, frame_idx=0, obj_id=1,
points=np.array([[500, 375]]),
labels=np.array([1]),
)
# Propagate through entire video
for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
# Process masks for each frame
process_masks(frame_idx, masks)
Hugging Face integration:
from sam2.sam2_image_predictor import SAM2ImagePredictor
predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")
from sam2.sam2_video_predictor import SAM2VideoPredictor
predictor = SAM2VideoPredictor.from_pretrained("facebook/sam2-hiera-large")
4. SAM 3: Segment Anything with Concepts (2025)
4.1 Paper Information
- Title: SAM 3: Segment Anything with Concepts
- Authors: Nicolas Carion, Laura Gustafson, Yuan-Ting Hu et al. (35 authors, Meta AI Research)
- Published: November 20, 2025 (arXiv), accepted at ICLR 2026
- Paper: arxiv.org/abs/2511.16719
- GitHub: github.com/facebookresearch/sam3
- Project page: ai.meta.com/sam3
- License: Apache 2.0
4.2 Paradigm Shift: From Where to What
SAM 1 and SAM 2 relied on geometric prompts (points, boxes) that specify "where" to segment. SAM 3 asks a fundamentally different question: "What" to segment?
SAM 1: Point/Box → "Isolate the object at this location"
SAM 2: Point/Box + Video Tracking → "Track this object through the video"
SAM 3: Text/Image Exemplar → "Find everything matching this concept"
Promptable Concept Segmentation (PCS) — the new task proposed by SAM 3:
- Text prompts: Natural language noun phrases (e.g., "yellow school bus")
- Image exemplars: Visual example images
- Combined prompts: Text + exemplar used simultaneously
- Legacy prompts: Points, boxes, masks (backward compatible with SAM 2)
4.3 Architecture Details
Total parameters: 848M (~3.4 GB)
┌───────────────────────────────────────────────────────────────────────┐
│ SAM 3 Architecture │
│ │
│ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │
│ │ Text │ │ Exemplar │ │ Visual Prompts │ │
│ │ Encoder │ │ Encoder │ │ (Point/Box/Mask) │ │
│ │ "school │ │ [image] │ │ │ │
│ │ bus" │ │ │ │ │ │
│ └─────┬──────┘ └─────┬──────┘ └────────┬─────────┘ │
│ └────────────┬───┘ │ │
│ ▼ │ │
│ ┌──────────────────────────────┐ │ │
│ │ Meta Perception Encoder │ │ │
│ │ (Vision-Language joint │ │ │
│ │ embedding space) │ │ │
│ └──────────────┬───────────────┘ │ │
│ │ │ │
│ ┌────────┴────────┐ │ │
│ ▼ ▼ │ │
│ ┌───────────┐ ┌─────────────────┐ │ │
│ │ Presence │ │ Fusion Encoder │ │ │
│ │ Head │ │ (Conditional │ │ │
│ │ "yes/no" │ │ features) │ │ │
│ └───────────┘ └────────┬────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────┐ │ │
│ │ DETR Detector │←────────┘ │
│ │ (Transformer- │ │
│ │ based detect.) │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ SAM 2-inspired Tracker │ │
│ │ (Memory Bank + temporal │ │
│ │ consistency) │ │
│ └──────────────────┬───────────────────┘ │
│ ▼ │
│ Masks + Boxes + Scores │
└───────────────────────────────────────────────────────────────────────┘
Key Innovation: Presence Head
The Presence Head is SAM 3's most important architectural innovation. It decouples recognition from localization.
Previous approach: Detector simultaneously "finds" and "judges existence" → optimization conflict
SAM 3: Presence Head first judges "does it exist?" → Detector focuses solely on "finding"
Impact:
| Configuration | CGF1 |
|---|---|
| Without Presence Head | 57.6 |
| With Presence Head | 63.3 (+5.7, +9.9%) |
Decoupled Detector-Tracker Design
Unlike SAM 2's monolithic architecture, SAM 3 separates detection and tracking into independent modules:
- Shared: backbone (Perception Encoder)
- Separate: decoder heads (detection vs. tracking)
- Advantage: reduced inter-task interference, independent scaling
Training Pipeline (4 Stages)
| Stage | Description |
|---|---|
| 1 | Perception Encoder pretraining |
| 2 | Detector pretraining |
| 3 | Detector fine-tuning with SA-Co data |
| 4 | Backbone frozen, then Tracker training |
4.4 SA-Co Dataset
SAM 3 was trained on SA-Co (Segment Anything with Concepts), the largest and most diverse segmentation dataset ever.
Training Data
| Subset | Contents |
|---|---|
| SA-Co/HQ | 5.2M high-quality images, 4M unique noun phrases, 52M masks |
| SA-Co/SYN | 38M synthetic phrases, 1.4 billion masks (large-scale pretraining) |
| SA-Co/VIDEO | 52,500 videos, 24,800 unique concepts, 467,000+ masklets |
Total ontology: 22 million entities (17 top-level categories, 72 subcategories)
Evaluation Benchmarks
| Benchmark | Description |
|---|---|
| SA-Co/Gold | 7 multi-review image subsets (highest quality) |
| SA-Co/Silver | 10 domain-specific image subsets |
| SA-Co/VEval | 3 video evaluation subsets |
- 214,000 unique phrases, 126,000 images/videos — over 50x more diverse concepts than existing COCO/LVIS
Additional Public Datasets
- SA-FARI: 10,000+ wildlife camera trap videos (100+ species annotated)
- FathomNet: Underwater segmentation benchmark for marine imagery
4.5 Performance Benchmarks
Image Segmentation (Zero-Shot)
| Benchmark | Metric | SAM 3 | Previous Best | Improvement |
|---|---|---|---|---|
| LVIS | Mask AP | 47.0 | 38.5 (T-Rex2) | +22.1% |
| COCO | Box AP | 53.5 | 52.2 | +2.5% |
| SA-Co/Gold | CGF1 | 65.0 | 34.3 | +89.5% (~2x) |
| ODinW | vs T-Rex2 | - | - | +20.1 AP |
Video Segmentation
| Benchmark | SAM 3 (J&F) | SAM 2.1 Large | Improvement |
|---|---|---|---|
| MOSEv2 | 60.1 | 47.9 | +25.5% |
| DAVIS 2017 | 92.0 | 90.7 | +1.4% |
| LVOSv2 | 88.2 | 79.6 | +10.8% |
Exemplar Utilization Effect
| Prompt Configuration | CGF1 | Improvement over Text Only |
|---|---|---|
| Text only | 46.4 | - |
| Text + 1 exemplar | 57.6 | +11.2 |
| Text + 3 exemplars | 65.0 | +18.6 |
Speed and Efficiency
- Inference speed: ~30ms per image on H200 GPU (including 100+ object detection)
- Model size: 848M parameters (~3.4 GB)
- Runnable on 16 GB GPUs
- vs. human performance: achieves 88% of the human lower bound on SA-Co/Gold
4.6 Installation and Usage
Official Repository
# Create environment
conda create -n sam3 python=3.12
conda activate sam3
# Install PyTorch
pip install torch==2.7.0 torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu126
# Install SAM 3
git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e .
# Optional: for notebooks or development
pip install -e ".[notebooks]"
pip install -e ".[train,dev]"
Image segmentation (text prompt):
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
model = build_sam3_image_model()
processor = Sam3Processor(model)
image = Image.open("path/to/image.jpg")
# Concept segmentation with text prompt
inference_state = processor.set_image(image)
output = processor.set_text_prompt(
state=inference_state,
prompt="yellow school bus"
)
masks, boxes, scores = output["masks"], output["boxes"], output["scores"]
Video tracking:
from sam3.model_builder import build_sam3_video_predictor
video_predictor = build_sam3_video_predictor()
response = video_predictor.handle_request(
request=dict(
type="start_session",
resource_path="path/to/video.mp4"
)
)
Ultralytics Integration
pip install -U ultralytics
from ultralytics.models.sam import SAM3SemanticPredictor
overrides = dict(
conf=0.25, task="segment", mode="predict",
model="sam3.pt", half=True
)
predictor = SAM3SemanticPredictor(overrides=overrides)
predictor.set_image("path/to/image.jpg")
# Segment multiple concepts simultaneously
results = predictor(text=["person", "bus", "glasses"])
5. Comprehensive Comparison of All Three Models
5.1 Architecture Comparison
| Item | SAM 1 | SAM 2 | SAM 3 |
|---|---|---|---|
| Image encoder | ViT (MAE) | Hiera (MAE) | Meta Perception Encoder |
| Parameters | 636M (ViT-H) | 224.4M (Hiera-L) | 848M |
| Prompts | Point, box, mask, text | Point, box, mask | Text, exemplar, point, box, mask |
| Temporal model | None | Memory Attention | Memory + Tracker |
| Detector | None | None | DETR-based |
| Occlusion | Not supported | Occlusion Head | Occlusion + Presence Head |
| Output | Mask + IoU | Mask + IoU + Occlusion | Mask + box + score + concept |
5.2 Dataset Comparison
| Item | SA-1B (SAM 1) | SA-V (SAM 2) | SA-Co (SAM 3) |
|---|---|---|---|
| Images | 11M | - | 5.2M (HQ) |
| Videos | - | 50,900 | 52,500 |
| Masks | 1.1B | 35.5M | 1.4B+ (incl. SYN) |
| Concepts/Labels | None (class-agnostic) | None | 22M entities |
| Unique phrases | - | - | 4M |
5.3 Performance Evolution
| Benchmark | SAM 1 | SAM 2/2.1 | SAM 3 |
|---|---|---|---|
| DAVIS 2017 (J&F) | N/A | 90.7 | 92.0 |
| MOSE val (J&F) | N/A | 74.6 | 60.1 (MOSEv2, different benchmark) |
| Image speed | Baseline | 6x faster | ~30ms/image |
| Zero-shot ability | Geometric | Geometric + temporal | Concept-level |
6. Practical Use Cases
6.1 When SAM 1 Is the Right Choice
- When fast segmentation on single images is needed
- Building interactive annotation tools
- When simple integration into existing pipelines is needed (most mature ecosystem)
6.2 When SAM 2 Is the Right Choice
- When video object tracking is needed
- Real-time video stream analysis
- Scenarios where occlusion handling is critical
- When faster speeds than SAM 1 are needed for images as well
6.3 When SAM 3 Is the Right Choice
- When segmenting via natural language search ("find all the cars")
- Open-vocabulary object detection + segmentation
- Visual exemplar-based few-shot segmentation
- Large-scale automatic labeling of images/videos
6.4 Real-World Deployment Examples (SAM 3)
- Instagram Edits app: Applying dynamic effects to specific people/objects in videos
- Facebook Marketplace "View in Room": AR furniture placement (SAM 3 + SAM 3D)
- Wildlife conservation: Camera trap monitoring based on the SA-FARI dataset
7. Study Roadmap
A step-by-step learning guide for deeply understanding and utilizing the SAM series.
7.1 Prerequisites
| Order | Topic | Recommended Reading |
|---|---|---|
| 1 | Vision Transformer (ViT) | An Image is Worth 16x16 Words |
| 2 | MAE (Masked Autoencoder) | Masked Autoencoders Are Scalable Vision Learners |
| 3 | DETR | End-to-End Object Detection with Transformers |
| 4 | CLIP | Learning Transferable Visual Models |
| 5 | Hiera | Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles |
7.2 Recommended Paper Reading Order
1. SAM 1 paper (2023) → Establish foundational concepts
↓
2. SAM 2 paper (2024) → Understand video extension
↓
3. SAM 3 paper (2025) → Understand concept awareness
↓
4. Hands-on notebooks from each GitHub repository
7.3 Recommended Hands-On Exercises
| Step | Exercise | Difficulty |
|---|---|---|
| 1 | SAM 1 automatic mask generation (Colab notebook) | Easy |
| 2 | SAM 2 video tracking walkthrough | Intermediate |
| 3 | SAM 3 text-prompt segmentation | Intermediate |
| 4 | SAM + Grounding DINO pipeline construction | Hard |
| 5 | Custom dataset fine-tuning | Hard |
| 6 | ONNX/TensorRT conversion and edge deployment | Advanced |
7.4 Related Follow-Up Research
| Model | Characteristics | Paper |
|---|---|---|
| EfficientSAM | Lightweight SAM (SAMI distill.) | arxiv.org/abs/2312.00863 |
| FastSAM | CNN-based real-time SAM | arxiv.org/abs/2306.12156 |
| MobileSAM | Mobile-optimized SAM | arxiv.org/abs/2306.14289 |
| Grounded SAM | Grounding DINO + SAM | github.com/IDEA-Research/Grounded-Segment-Anything |
| SAM-HQ | High-quality mask SAM | arxiv.org/abs/2306.01567 |
| SAM 3D | 3D object/body reconstruction (Meta) | ai.meta.com/sam3 |
8. References
Core Papers
- Kirillov, A., Mintun, E., Ravi, N., et al. (2023). "Segment Anything". ICCV 2023. arxiv.org/abs/2304.02643
- Ravi, N., Gabeur, V., Hu, Y.-T., et al. (2024). "SAM 2: Segment Anything in Images and Videos". arxiv.org/abs/2408.00714
- Carion, N., Gustafson, L., Hu, Y.-T., et al. (2025). "SAM 3: Segment Anything with Concepts". ICLR 2026. arxiv.org/abs/2511.16719
Prerequisite Papers
- Dosovitskiy, A., et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". ICLR 2021. arxiv.org/abs/2010.11929
- He, K., et al. (2022). "Masked Autoencoders Are Scalable Vision Learners". CVPR 2022. arxiv.org/abs/2111.06377
- Carion, N., et al. (2020). "End-to-End Object Detection with Transformers (DETR)". ECCV 2020. arxiv.org/abs/2005.12872
- Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision (CLIP)". ICML 2021. arxiv.org/abs/2103.00020
- Ryali, C., et al. (2023). "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles". ICML 2023. arxiv.org/abs/2306.00989
GitHub Repositories
- github.com/facebookresearch/segment-anything — SAM 1
- github.com/facebookresearch/sam2 — SAM 2
- github.com/facebookresearch/sam3 — SAM 3
Tutorials and Guides
- Ultralytics SAM Documentation — docs.ultralytics.com/models/sam
- Ultralytics SAM 2 Documentation — docs.ultralytics.com/models/sam-2
- Ultralytics SAM 3 Documentation — docs.ultralytics.com/models/sam-3
- Encord Blog: Segment Anything Model Explained — encord.com/blog/segment-anything-model-explained
- Roboflow: What is SAM 3 — blog.roboflow.com/what-is-sam3
- Meta AI Blog: Introducing SAM 3 — ai.meta.com/blog/segment-anything-model-3
Project Pages and Datasets
- Meta AI SAM 2 Project — ai.meta.com/sam2
- Meta AI SAM 3 Project — ai.meta.com/sam3
- SA-1B Dataset — ai.meta.com/datasets/segment-anything
- SA-V Dataset — ai.meta.com/datasets/segment-anything-video