- Published on
Analyzing SOTA Segmentation and Detection Models — The Lineage of SAM, DETR, and YOLO
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- The Lineage of Object Detection
- Three Tasks of Image Segmentation
- Segment Anything: Promptable Segmentation
- Open-Vocabulary Detection and Segmentation
- Backbone Evolution: What Extracts the Features
- Evaluation Metrics: mAP and IoU
- The Loss Function Angle
- Practical Pipeline and Common Mistakes
- Combining Detection and Segmentation
- The Real-Time Detection View
- Strengths and Limits Summary
- The Lineage at a Glance
- What to Choose in Which Situation
- Conclusion
- References
- Quiz
Introduction
In computer vision, two axes answer the question "what is where": object detection and image segmentation. Detection wraps objects in rectangular boxes to report location and class; segmentation traces object boundaries at the pixel level. This article organizes how the representative models in both areas evolved along a lineage of ideas, centered on architectural principles.
Let me state one premise up front. Vision SOTA moves very quickly, and rankings or numbers on any given benchmark shift with version and configuration. So rather than exact leaderboard positions, this article focuses on the core concept each model introduced and the influence that concept left on later work. Where the latest specs are uncertain, I avoid firm claims and describe things at the family level.
The Lineage of Object Detection
Two-Stage Detection: The R-CNN Family
Early deep learning approaches to detection used a two-stage structure: "first propose regions where objects might be, then classify each candidate." R-CNN opened this line, and it evolved with improvements to speed and accuracy.
- R-CNN: Extract candidate regions with selective search, run each region through a CNN to get features, then classify. Accurate but very slow, since the CNN runs repeatedly per candidate.
- Fast R-CNN: Pass the whole image through the CNN once and, with RoI Pooling, crop each candidate's features from the shared feature map for reuse. This drastically cut redundant computation.
- Faster R-CNN: Integrated candidate generation itself into a network (Region Proposal Network, RPN). Now the entire detector became a single trainable pipeline.
The two-stage family achieves high accuracy, but its split of proposal and judgment made it relatively heavy for real-time inference.
One-Stage Detection: YOLO and SSD
One-stage detection removes the region proposal step: it divides the image into a grid and predicts boxes and classes in one shot at each location. True to the name "You Only Look Once," a single forward pass completes detection.
- YOLO: Split the image into a grid and let each cell directly regress box coordinates and class probabilities. Fast and suited to real-time applications.
- SSD (Single Shot Detector): Predict boxes of various sizes from feature maps at multiple resolutions, handling small and large objects together.
The one-stage family balances speed and accuracy and has been widely used in practice. YOLO in particular has gone through many versions, continuously improving the backbone, feature fusion (neck), loss functions, and training tricks. Since the details and performance differ by version, I only summarize the family's trajectory: "starting from one-stage, anchor-based designs, it absorbed anchor-free and end-to-end elements to become the representative family for real-time detection."
Anchors and NMS: Two Manual Knobs
Traditional detectors relied on two human-designed elements.
- Anchors: Predefined reference boxes of various sizes and aspect ratios. The model adjusts (regresses) object boxes relative to anchors.
- NMS (Non-Maximum Suppression): When multiple overlapping boxes appear for one object, a post-processing step that removes lower-scoring duplicates.
These two worked well, but they required hyperparameter tuning and prevented the pipeline from being "fully end-to-end." DETR, which comes next, targets exactly this point.
Transformer Detection: DETR
DETR (DEtection TRansformer, arXiv 2005.12872) reframed detection as a set prediction problem. Instead of grids, anchors, and NMS, its core idea is to have a transformer directly output a set of objects.
The basic idea is as follows.
- A CNN backbone extracts image features, and a transformer encoder organizes global relationships.
- The decoder takes a set of learned object queries, and each query predicts one object (or "no object").
- During training, Hungarian matching pairs predictions and ground truth one-to-one, so duplicate predictions don't arise in the first place. As a result, NMS post-processing becomes unnecessary.
DETR detection pipeline (concept)
image
|
[CNN backbone] --> feature map
|
[transformer encoder] (global self-attention)
|
[transformer decoder] <-- set of object queries (e.g. N)
|
each query -> (box coords, class or "no object")
|
Hungarian matching pairs 1:1 with ground truth for training
|
result: directly outputs a set of objects, no NMS
DETR is conceptually elegant, but early versions converged slowly and were weak on small objects. Later work targeted these issues.
- Deformable DETR: Attend only around a few reference points rather than the whole feature map, speeding up convergence and improving small-object performance.
- DINO and other improvements: Follow-up families raised accuracy and convergence with better query initialization, denoising training, and more.
To summarize, detection's broad arc runs "two-stage (accurate but heavy) -> one-stage (fast, real-time) -> transformer set prediction (end-to-end, post-processing removed)," and in practice all three families still coexist depending on the goal.
Detection Family Comparison
| Family | Representative | Proposals | Post-processing | Character |
|---|---|---|---|---|
| Two-stage | Faster R-CNN | RPN | NMS | High accuracy, relatively heavy |
| One-stage | YOLO, SSD | none (direct) | NMS | Fast, real-time suited |
| Transformer | DETR family | object queries | not needed (set prediction) | End-to-end, conceptually simple |
Treat the accuracy and speed in the table as general tendencies of each family, not absolute rankings. Actual values shift substantially with backbone, input resolution, training data, and version.
Three Tasks of Image Segmentation
Segmentation answers "what is where" at the pixel level. It splits into three tasks by goal.
- Semantic segmentation: Assign a class to each pixel. Different instances of the same class are not distinguished. Example: paint road, sky, and person regions, but treat three people as one blob.
- Instance segmentation: Distinguish individual objects (instances) with pixel masks. Three people are separated into three different masks.
- Panoptic segmentation: Unifies semantic and instance. Uncountable regions like background (stuff) are handled semantically, and countable objects (things) as instances, so the whole scene is described completely.
same scene, three segmentation views
semantic: [person][person][person] -> all one "person" color
instance: [person1][person2][person3] -> per-object mask
panoptic: things(person1,2,3) + stuff(road, sky) unified
The Arc of Segmentation Architectures
- FCN (Fully Convolutional Network): Replaced fully connected layers with convolutions, opening the way to per-pixel classification for arbitrary input sizes.
- U-Net: Shrink with an encoder to gain context, restore with a decoder to recover detail, and pass low-level location information through skip connections. Widely used in medical imaging.
- Mask R-CNN: Added a mask prediction branch to Faster R-CNN, becoming a standard approach to instance segmentation. A representative case of binding detection and segmentation in one framework.
- Mask-based transformer approaches: Recently, transformer families actively try to unify semantic, instance, and panoptic under a single "mask set prediction" framework. This can be seen as DETR's set-prediction mindset extended to segmentation.
Segment Anything: Promptable Segmentation
A Shift in Perspective
Segment Anything (SAM, arXiv 2304.02643) changed how we view segmentation. Prior models mostly trained on a specific dataset to assign a "fixed set of classes" to pixels. SAM instead proposes the task of promptable segmentation. Given a prompt such as a point, box, or rough mask, it segments the target that prompt indicates, even without a class name.
Structure
SAM has three main parts.
SAM structure (concept)
image --> [image encoder (heavy, computed once)] --> image embedding
|
prompt (point/box/mask) --> [prompt encoder] --> prompt embedding
|
v
[lightweight mask decoder] --> mask(s)
|
when ambiguous: several candidate masks + scores
- Image encoder: Usually a heavy vision transformer that encodes the image once. Because this embedding is reused, multiple prompts on the same image can be processed quickly.
- Prompt encoder: Converts point, box, and mask prompts into embeddings.
- Mask decoder: Combines the image embedding and prompt embedding to output masks. When a prompt is ambiguous (e.g., clicking a point on clothing: is it "clothing" or "person"?), it outputs several candidate masks with confidence scores.
The Data Engine
Another axis often emphasized in SAM is how the large-scale mask data was built. Through an iterative process where model and humans took turns generating and reviewing masks, it constructed mask data at a scale hard to draw by hand alone. Much of the generalization ability to "separate many kinds of objects by prompt" comes from this large, diverse data.
Meaning and Limits of SAM
- Meaning: SAM showed the possibility of general-purpose segmentation not tied to specific classes. Combined with other detectors and recognizers, it becomes a powerful building block for downstream tasks like "cut the object in this box into a mask."
- Limits: SAM itself does not directly assign a semantic label like "this is a cat." Class recognition needs a separate model or prompt design. Also, the heavy image encoder means lightweight variants or follow-up work are needed for real-time and lightweight settings.
Open-Vocabulary Detection and Segmentation
Traditional detectors recognize only the fixed list of classes seen in training. The open-vocabulary approach tries to loosen this constraint. Using representations trained jointly on text and images (e.g., contrastive families that align images and text in the same space), one can indicate even classes not explicitly labeled during training via text descriptions.
- Idea: Treat classes as text embeddings rather than fixed integer labels. You can specify targets with free expressions like "red umbrella."
- Effect: Flexible, since you can point at new categories without retraining each time.
- Caution: The accuracy of free-text instructions varies widely by target, phrasing, and domain. In practice, validation on the target domain is required.
Combining SAM-style promptable segmentation with open-vocabulary detection lets you build a pipeline that "specifies what to find via text and separates that target into a pixel mask." A hallmark of recent trends is that detection, segmentation, and language-aligned representations combine like building blocks.
Backbone Evolution: What Extracts the Features
Detection and segmentation models can usually be split into three parts: "backbone + neck + head." Among these, the backbone is the core part that extracts features from the image, and backbone progress has driven much of detection and segmentation performance.
- CNN backbones: Families like ResNet stack convolutions deeply to learn local patterns hierarchically. They were the standard detection and segmentation backbone for a long time.
- Feature pyramid (FPN): Connects feature maps of different resolutions top-down to handle small objects (high resolution) and large objects (low resolution) together. A core part of multi-scale detection.
- Vision transformer backbones: Split the image into patches and learn global relationships with attention. Hierarchical transformer backbones preserve both locality and globality and are widely used as detection and segmentation backbones too.
3-part structure of detection/segmentation models (concept)
image
|
[backbone] --> feature maps at various resolutions
|
[neck (FPN, etc.)] --> cross-scale feature fusion
|
[head] --> box/class or mask prediction
Backbone choice is a big knob for accuracy and speed. A heavy backbone is accurate but slow; a light backbone is fast but loses out on hard scenes. In practice you pick a backbone for target latency and accuracy, and start from pretrained weights if needed.
Evaluation Metrics: mAP and IoU
Understanding how detection and segmentation results are judged "good or bad" lets you read benchmark numbers far more accurately.
- IoU (Intersection over Union): The degree of overlap between predicted and ground-truth regions. Intersection divided by union; closer to 1 is a better match. Whether box or mask, it is the basic measure of "how much overlaps."
- Precision and recall: Precision is "the fraction of predictions that were correct," recall is "the fraction of ground truth that was found." They tend to trade off, so look at both.
- AP and mAP (mean Average Precision): The area under the precision-recall curve at various confidence thresholds, averaged per class. The representative metric for detection and instance segmentation.
IoU concept (box example)
IoU = intersection area / union area
good match: prediction and ground truth overlap greatly --> high IoU
miss: small overlap --> low IoU
usually at or above an IoU threshold (e.g. 0.5) counts as a "correct detection"
Note that one mAP number is not enough to rank models. Even at the same mAP, strengths on small objects, large objects, and specific classes can differ. Rankings also flip depending on the benchmark dataset's characteristics (urban roads, indoors, aerial, etc.). A number only means something when you also know "on which dataset, at which IoU criterion" it was measured.
The Loss Function Angle
Detection and segmentation models train by combining several losses. Knowing what each loss teaches makes the model's behavior easier to understand.
- Classification loss: Teaches which class (or background) each location or candidate is. Losses that weight hard examples are sometimes used to handle the imbalance where background is far more common.
- Box regression loss: Adjusts predicted boxes closer to ground-truth boxes. It either measures coordinate differences directly or uses IoU itself as the loss.
- Mask loss (segmentation): Compares masks with ground truth per pixel. Pixel classification loss or region-overlap loss are used together.
- Matching loss (DETR family): Pairs the predicted set and ground-truth set one-to-one, then computes the above losses only on the matched pairs.
Loss design often governs accuracy and stability. If background imbalance is not handled well, for example, the model can drift toward predicting "everything is background."
One thing to remember is that the weight balance among losses shifts the model's tendencies. Weighting box regression heavily makes locations accurate but can shake classification, and vice versa. A loss that emphasizes hard examples helps catch rare objects, but with much label noise it risks learning the noise instead. Think of loss as the language that tells the model "what to consider more important."
Practical Pipeline and Common Mistakes
Here are points you meet repeatedly when building a real detection and segmentation system.
- Data is 80 percent: The quantity and quality of domain-appropriate labeled data governs performance. Data with varied lighting, angle, and occlusion helps generalization.
- Class imbalance: When the ratio between common and rare classes is large, rare classes get ignored. Ease it with sampling, weighting, and augmentation.
- Small-object problem: Small objects have few pixels in feature maps and are easy to miss. High-resolution input, multi-scale features (FPN), and tiled inference help.
- Domain gap: When training data differs from the actual deployment environment, performance drops. Fine-tuning on deployment data or domain adaptation is needed.
- Post-processing tuning: Values like NMS threshold and confidence threshold must be tuned to the goal (recall-first vs. precision-first).
a typical detection pipeline (concept)
data collection / labeling
|
preprocess/augment (crop, color jitter, mosaic, etc.)
|
train with pretrained backbone
|
measure mAP on validation + tune thresholds
|
fine-tune on deployment-environment data
|
lighten (quantize, etc.) then deploy
|
collect false positives/negatives in operation -> feed back to data
The most common mistake is assuming "the benchmark numbers are good, so our problem will be solved too." Always validate the gap between benchmark and real domain, and keep the habit of re-measuring on your own data.
Combining Detection and Segmentation
One big feature of recent trends is that detection, segmentation, and language combine like parts. Representative combinations are as follows.
- Detection + SAM: Get object boxes with a detector, feed those boxes as SAM prompts to obtain precise masks. Detection handles "what is where," SAM handles "the pixel boundary."
- Open-vocabulary detection + SAM: Instruct targets via text to get boxes, cut masks with SAM, then follow across frames with a tracker if needed.
- Segmentation + depth/3D: Combining segmentation masks with depth information tells you "where and how far this object is," useful in robotics and AR.
Such combinations are possible because each part evolves independently yet connects through standard interfaces (boxes, masks, text embeddings). To understand the lineage is to know how these parts can be assembled.
The Real-Time Detection View
In practice, detection is often bound by latency. The factors that govern real-time detection are roughly as follows.
- Backbone lightening: Raise throughput with reduced-compute backbones and efficient feature fusion.
- Anchor-free and end-to-end elements: Simplify the pipeline by reducing post-processing and tuning burden.
- Input resolution tuning: Lowering resolution is faster but hurts small-object performance, a trade-off.
- Hardware and quantization: Lower precision (quantization) or optimize for a specific accelerator to raise throughput.
The reason the YOLO family has long been cited as the representative of real-time detection is that it balances these factors without heavily sacrificing accuracy. That said, which version is "best" depends on target latency, accuracy requirements, and deployment hardware, so it is safer to avoid absolute conclusions.
Strengths and Limits Summary
| Approach | Strength | Limit |
|---|---|---|
| Faster R-CNN | Stable high accuracy | Relatively heavy for real-time |
| YOLO/SSD | Fast real-time inference | May struggle with tiny or dense scenes |
| DETR family | End-to-end, no NMS | Slow convergence in early versions (later fixed) |
| Mask R-CNN | Unified detection + instance segmentation | Heavy pipeline |
| SAM | Class-agnostic general segmentation | No semantic labels, heavy encoder |
| Open-vocabulary | Instruct via free text | Accuracy varies by domain and phrasing |
The Lineage at a Glance
Organizing the arc so far as a chronological lineage of progress gives the following. Note what each stage absorbed from the previous stage into learning, and the big picture emerges.
lineage of detection/segmentation progress (concept)
[detection]
R-CNN --> Fast R-CNN --> Faster R-CNN (absorb proposals into RPN)
\--> YOLO / SSD (remove proposals, one-stage)
\--> DETR (remove anchors/NMS, set prediction)
\--> Deformable/DINO etc (better convergence, small objects)
[segmentation]
FCN --> U-Net (encoder-decoder, skips)
\--> Mask R-CNN (detection + instance segmentation)
\--> mask transformer family (unified mask set prediction)
[generalization]
SAM (promptable segmentation)
+ open-vocabulary (instruct via text)
= assembly-style vision system
The recurring pattern in this lineage is "replacing elements humans used to hand-tune with learnable parts." With this view, you can quickly place a newly appearing model at its spot in the lineage.
One more thing: this lineage is not a story of "the new completely replacing the old." Two-stage detectors like Faster R-CNN are still used where accuracy matters, the YOLO family remains the real-time standard, and the DETR family expands its ground with post-processing-free elegance. Each family sits on different trade-offs, and practice chooses by where among those trade-offs it wants to stand.
What to Choose in Which Situation
- Accuracy first, latency to spare: Two-stage like Faster R-CNN or a detector with a strong backbone. For precise instance segmentation, the Mask R-CNN family.
- Real-time essential (surveillance, robots, mobile): A one-stage detector like the YOLO family with a lightweight backbone and quantization.
- Want to reduce post-processing tuning: The end-to-end approach of the DETR family. Consider training cost and convergence, though.
- Need precise masks, flexible classes: The combination of finding targets with a detector and cutting masks with SAM.
- Must handle new classes often: Text instructions via open-vocabulary detection. Domain validation is essential.
There is no single answer; usually you combine several parts to fit the goal. This is the practical reason to understand the lineage.
Conclusion
The evolution of detection and segmentation can be summed up as "a process of absorbing human-designed components into learning, one at a time." Proposals were replaced by the RPN, post-processing by set prediction, and fixed classes by text-aligned representations. SAM added the axis of "segment anything by prompt."
The key is not any model's momentary ranking, but that the concepts each model introduced recombine like parts, keeping the whole vision system flexible. Even when a new SOTA appears, understanding the mindset of this lineage lets you follow the change far faster.
Finally, I want to stress one attitude for practitioners. Rather than always grabbing the top model on a benchmark leaderboard, it is better to first clarify the trade-offs your problem demands (latency vs. accuracy, box vs. mask, fixed class vs. free instruction). Once that trade-off is set, choosing and assembling the right parts on the lineage organized here becomes far easier. Models keep changing, but the mindset of handling trade-offs endures.
References
- DETR paper "End-to-End Object Detection with Transformers": arxiv.org/abs/2005.12872
- Segment Anything paper: arxiv.org/abs/2304.02643
- Segment Anything official page: segment-anything.com
- Faster R-CNN paper: arxiv.org/abs/1506.01497
- Mask R-CNN paper: arxiv.org/abs/1703.06870
- U-Net paper: arxiv.org/abs/1505.04597
- Deformable DETR paper: arxiv.org/abs/2010.04159
- SSD paper: arxiv.org/abs/1512.02325
- Segment Anything code: github.com/facebookresearch/segment-anything
Quiz
Q1: What is the fundamental difference between two-stage and one-stage detection?
Two-stage first generates candidate regions, then classifies each candidate (e.g., Faster R-CNN). One-stage predicts boxes and classes directly from the image with no proposal step (e.g., YOLO, SSD). One-stage is usually faster.
Q2: Why can DETR eliminate NMS?
Because during training Hungarian matching pairs predictions and ground truth one-to-one, so duplicate predictions don't arise. As a set-prediction method, it has no need to remove overlapping boxes afterward.
Q3: What distinguishes semantic, instance, and panoptic segmentation?
Semantic assigns only a class per pixel and does not separate instances. Instance separates individual objects into distinct masks. Panoptic unifies both: background (stuff) semantically and objects (things) as instances.
Q4: What is "promptable segmentation" as SAM describes it?
Given prompts like a point, box, or rough mask, it segments the target even without a class name. It aims at general-purpose segmentation not bound to a fixed set of classes.
Q5: Why do SAM's image encoder and mask decoder have different compute costs?
The image encoder is heavy but runs once per image, reusing the embedding. The mask decoder is lightweight, so multiple prompts on the same image can be processed quickly.
Q6: What is the core idea of open-vocabulary detection?
Treating classes as text embeddings rather than fixed integer labels, so targets not explicitly labeled during training can be indicated via free text. It leverages text-image aligned representations.
Q7: Name one limitation of SAM.
SAM only produces masks; it does not directly attach a semantic label like "this is X." Class recognition needs a separate model, and its heavy image encoder requires lightening for real-time and lightweight settings.
Q8: Why is it hard to assert absolute rankings when choosing a detection model?
Because accuracy and speed vary greatly with backbone, input resolution, training data, version, and deployment hardware. It is realistic to choose a family based on target latency and accuracy needs.
Q9: What do IoU and mAP each measure?
IoU measures the overlap (intersection/union) between predicted and ground-truth regions, a basic metric. mAP is the area under the precision-recall curve at various confidence thresholds averaged per class, the representative metric for detection and instance segmentation.
Q10: In the "backbone + neck + head" structure, what is the role of the neck (FPN, etc.)?
It fuses feature maps of different resolutions to handle small objects (high resolution) and large objects (low resolution) together. A core part of multi-scale detection.
Q11: Why does "high benchmark mAP" not guarantee solving the real problem?
Because the benchmark dataset's characteristics can differ from the actual deployment domain. Different lighting, angle, and class distribution hurt performance, so re-validating on your own data matters.
Q12: What is a typical way to combine a detector with SAM?
Get object boxes with a detector, then feed those boxes as SAM prompts to obtain precise pixel masks. The detector handles "what is where," SAM handles "the pixel boundary," a division of roles.