Analyzing SOTA Segmentation and Detection Models — The Lineage of SAM, DETR, and YOLO

Introduction
The Lineage of Object Detection
Three Tasks of Image Segmentation
- The Arc of Segmentation Architectures
Segment Anything: Promptable Segmentation
Open-Vocabulary Detection and Segmentation
Backbone Evolution: What Extracts the Features
Evaluation Metrics: mAP and IoU
The Loss Function Angle
Practical Pipeline and Common Mistakes
Combining Detection and Segmentation
The Real-Time Detection View
Strengths and Limits Summary
The Lineage at a Glance
What to Choose in Which Situation
Conclusion
References
Quiz

Introduction

In computer vision, two axes answer the question "what is where": object detection and image segmentation. Detection wraps objects in rectangular boxes to report location and class; segmentation traces object boundaries at the pixel level. This article organizes how the representative models in both areas evolved along a lineage of ideas, centered on architectural principles.

Let me state one premise up front. Vision SOTA moves very quickly, and rankings or numbers on any given benchmark shift with version and configuration. So rather than exact leaderboard positions, this article focuses on the core concept each model introduced and the influence that concept left on later work. Where the latest specs are uncertain, I avoid firm claims and describe things at the family level.

The Lineage of Object Detection

Two-Stage Detection: The R-CNN Family

Early deep learning approaches to detection used a two-stage structure: "first propose regions where objects might be, then classify each candidate." R-CNN opened this line, and it evolved with improvements to speed and accuracy.

R-CNN: Extract candidate regions with selective search, run each region through a CNN to get features, then classify. Accurate but very slow, since the CNN runs repeatedly per candidate.
Fast R-CNN: Pass the whole image through the CNN once and, with RoI Pooling, crop each candidate's features from the shared feature map for reuse. This drastically cut redundant computation.
Faster R-CNN: Integrated candidate generation itself into a network (Region Proposal Network, RPN). Now the entire detector became a single trainable pipeline.

The two-stage family achieves high accuracy, but its split of proposal and judgment made it relatively heavy for real-time inference.

One-Stage Detection: YOLO and SSD

One-stage detection removes the region proposal step: it divides the image into a grid and predicts boxes and classes in one shot at each location. True to the name "You Only Look Once," a single forward pass completes detection.

YOLO: Split the image into a grid and let each cell directly regress box coordinates and class probabilities. Fast and suited to real-time applications.
SSD (Single Shot Detector): Predict boxes of various sizes from feature maps at multiple resolutions, handling small and large objects together.

The one-stage family balances speed and accuracy and has been widely used in practice. YOLO in particular has gone through many versions, continuously improving the backbone, feature fusion (neck), loss functions, and training tricks. Since the details and performance differ by version, I only summarize the family's trajectory: "starting from one-stage, anchor-based designs, it absorbed anchor-free and end-to-end elements to become the representative family for real-time detection."

Anchors and NMS: Two Manual Knobs

Traditional detectors relied on two human-designed elements.

Anchors: Predefined reference boxes of various sizes and aspect ratios. The model adjusts (regresses) object boxes relative to anchors.
NMS (Non-Maximum Suppression): When multiple overlapping boxes appear for one object, a post-processing step that removes lower-scoring duplicates.

These two worked well, but they required hyperparameter tuning and prevented the pipeline from being "fully end-to-end." DETR, which comes next, targets exactly this point.

Transformer Detection: DETR

DETR (DEtection TRansformer, arXiv 2005.12872) reframed detection as a set prediction problem. Instead of grids, anchors, and NMS, its core idea is to have a transformer directly output a set of objects.

The basic idea is as follows.

A CNN backbone extracts image features, and a transformer encoder organizes global relationships.
The decoder takes a set of learned object queries, and each query predicts one object (or "no object").
During training, Hungarian matching pairs predictions and ground truth one-to-one, so duplicate predictions don't arise in the first place. As a result, NMS post-processing becomes unnecessary.

        DETR detection pipeline (concept)

  image
    |
  [CNN backbone]  --> feature map
    |
  [transformer encoder]  (global self-attention)
    |
  [transformer decoder] <-- set of object queries (e.g. N)
    |
  each query -> (box coords, class or "no object")
    |
  Hungarian matching pairs 1:1 with ground truth for training
    |
  result: directly outputs a set of objects, no NMS

DETR is conceptually elegant, but early versions converged slowly and were weak on small objects. Later work targeted these issues.

Deformable DETR: Attend only around a few reference points rather than the whole feature map, speeding up convergence and improving small-object performance.
DINO and other improvements: Follow-up families raised accuracy and convergence with better query initialization, denoising training, and more.

To summarize, detection's broad arc runs "two-stage (accurate but heavy) -> one-stage (fast, real-time) -> transformer set prediction (end-to-end, post-processing removed)," and in practice all three families still coexist depending on the goal.

Detection Family Comparison

Family	Representative	Proposals	Post-processing	Character
Two-stage	Faster R-CNN	RPN	NMS	High accuracy, relatively heavy
One-stage	YOLO, SSD	none (direct)	NMS	Fast, real-time suited
Transformer	DETR family	object queries	not needed (set prediction)	End-to-end, conceptually simple

Treat the accuracy and speed in the table as general tendencies of each family, not absolute rankings. Actual values shift substantially with backbone, input resolution, training data, and version.

Three Tasks of Image Segmentation

Segmentation answers "what is where" at the pixel level. It splits into three tasks by goal.

Semantic segmentation: Assign a class to each pixel. Different instances of the same class are not distinguished. Example: paint road, sky, and person regions, but treat three people as one blob.
Instance segmentation: Distinguish individual objects (instances) with pixel masks. Three people are separated into three different masks.
Panoptic segmentation: Unifies semantic and instance. Uncountable regions like background (stuff) are handled semantically, and countable objects (things) as instances, so the whole scene is described completely.

   same scene, three segmentation views

  semantic:  [person][person][person]  -> all one "person" color
  instance:  [person1][person2][person3] -> per-object mask
  panoptic:  things(person1,2,3) + stuff(road, sky) unified

The Arc of Segmentation Architectures

FCN (Fully Convolutional Network): Replaced fully connected layers with convolutions, opening the way to per-pixel classification for arbitrary input sizes.
U-Net: Shrink with an encoder to gain context, restore with a decoder to recover detail, and pass low-level location information through skip connections. Widely used in medical imaging.
Mask R-CNN: Added a mask prediction branch to Faster R-CNN, becoming a standard approach to instance segmentation. A representative case of binding detection and segmentation in one framework.
Mask-based transformer approaches: Recently, transformer families actively try to unify semantic, instance, and panoptic under a single "mask set prediction" framework. This can be seen as DETR's set-prediction mindset extended to segmentation.

Segment Anything: Promptable Segmentation

A Shift in Perspective

Segment Anything (SAM, arXiv 2304.02643) changed how we view segmentation. Prior models mostly trained on a specific dataset to assign a "fixed set of classes" to pixels. SAM instead proposes the task of promptable segmentation. Given a prompt such as a point, box, or rough mask, it segments the target that prompt indicates, even without a class name.

Structure

SAM has three main parts.

        SAM structure (concept)

  image --> [image encoder (heavy, computed once)] --> image embedding
                                                        |
  prompt (point/box/mask) --> [prompt encoder] --> prompt embedding
                                                        |
                                                        v
                          [lightweight mask decoder] --> mask(s)
                                                        |
                          when ambiguous: several candidate masks + scores

Image encoder: Usually a heavy vision transformer that encodes the image once. Because this embedding is reused, multiple prompts on the same image can be processed quickly.
Prompt encoder: Converts point, box, and mask prompts into embeddings.
Mask decoder: Combines the image embedding and prompt embedding to output masks. When a prompt is ambiguous (e.g., clicking a point on clothing: is it "clothing" or "person"?), it outputs several candidate masks with confidence scores.

The Data Engine

Another axis often emphasized in SAM is how the large-scale mask data was built. Through an iterative process where model and humans took turns generating and reviewing masks, it constructed mask data at a scale hard to draw by hand alone. Much of the generalization ability to "separate many kinds of objects by prompt" comes from this large, diverse data.

Meaning and Limits of SAM

Meaning: SAM showed the possibility of general-purpose segmentation not tied to specific classes. Combined with other detectors and recognizers, it becomes a powerful building block for downstream tasks like "cut the object in this box into a mask."
Limits: SAM itself does not directly assign a semantic label like "this is a cat." Class recognition needs a separate model or prompt design. Also, the heavy image encoder means lightweight variants or follow-up work are needed for real-time and lightweight settings.

Open-Vocabulary Detection and Segmentation

Traditional detectors recognize only the fixed list of classes seen in training. The open-vocabulary approach tries to loosen this constraint. Using representations trained jointly on text and images (e.g., contrastive families that align images and text in the same space), one can indicate even classes not explicitly labeled during training via text descriptions.

Idea: Treat classes as text embeddings rather than fixed integer labels. You can specify targets with free expressions like "red umbrella."
Effect: Flexible, since you can point at new categories without retraining each time.
Caution: The accuracy of free-text instructions varies widely by target, phrasing, and domain. In practice, validation on the target domain is required.

Combining SAM-style promptable segmentation with open-vocabulary detection lets you build a pipeline that "specifies what to find via text and separates that target into a pixel mask." A hallmark of recent trends is that detection, segmentation, and language-aligned representations combine like building blocks.

Backbone Evolution: What Extracts the Features

Detection and segmentation models can usually be split into three parts: "backbone + neck + head." Among these, the backbone is the core part that extracts features from the image, and backbone progress has driven much of detection and segmentation performance.

CNN backbones: Families like ResNet stack convolutions deeply to learn local patterns hierarchically. They were the standard detection and segmentation backbone for a long time.
Feature pyramid (FPN): Connects feature maps of different resolutions top-down to handle small objects (high resolution) and large objects (low resolution) together. A core part of multi-scale detection.
Vision transformer backbones: Split the image into patches and learn global relationships with attention. Hierarchical transformer backbones preserve both locality and globality and are widely used as detection and segmentation backbones too.

   3-part structure of detection/segmentation models (concept)

  image
    |
  [backbone]  --> feature maps at various resolutions
    |
  [neck (FPN, etc.)]  --> cross-scale feature fusion
    |
  [head]  --> box/class or mask prediction

Backbone choice is a big knob for accuracy and speed. A heavy backbone is accurate but slow; a light backbone is fast but loses out on hard scenes. In practice you pick a backbone for target latency and accuracy, and start from pretrained weights if needed.

Evaluation Metrics: mAP and IoU

Understanding how detection and segmentation results are judged "good or bad" lets you read benchmark numbers far more accurately.

IoU (Intersection over Union): The degree of overlap between predicted and ground-truth regions. Intersection divided by union; closer to 1 is a better match. Whether box or mask, it is the basic measure of "how much overlaps."
Precision and recall: Precision is "the fraction of predictions that were correct," recall is "the fraction of ground truth that was found." They tend to trade off, so look at both.
AP and mAP (mean Average Precision): The area under the precision-recall curve at various confidence thresholds, averaged per class. The representative metric for detection and instance segmentation.

   IoU concept (box example)

  IoU = intersection area / union area

  good match: prediction and ground truth overlap greatly --> high IoU
  miss:       small overlap --> low IoU

  usually at or above an IoU threshold (e.g. 0.5) counts as a "correct detection"

Note that one mAP number is not enough to rank models. Even at the same mAP, strengths on small objects, large objects, and specific classes can differ. Rankings also flip depending on the benchmark dataset's characteristics (urban roads, indoors, aerial, etc.). A number only means something when you also know "on which dataset, at which IoU criterion" it was measured.

The Loss Function Angle

Detection and segmentation models train by combining several losses. Knowing what each loss teaches makes the model's behavior easier to understand.

Classification loss: Teaches which class (or background) each location or candidate is. Losses that weight hard examples are sometimes used to handle the imbalance where background is far more common.
Box regression loss: Adjusts predicted boxes closer to ground-truth boxes. It either measures coordinate differences directly or uses IoU itself as the loss.
Mask loss (segmentation): Compares masks with ground truth per pixel. Pixel classification loss or region-overlap loss are used together.
Matching loss (DETR family): Pairs the predicted set and ground-truth set one-to-one, then computes the above losses only on the matched pairs.

Loss design often governs accuracy and stability. If background imbalance is not handled well, for example, the model can drift toward predicting "everything is background."

One thing to remember is that the weight balance among losses shifts the model's tendencies. Weighting box regression heavily makes locations accurate but can shake classification, and vice versa. A loss that emphasizes hard examples helps catch rare objects, but with much label noise it risks learning the noise instead. Think of loss as the language that tells the model "what to consider more important."

Practical Pipeline and Common Mistakes

Here are points you meet repeatedly when building a real detection and segmentation system.

Data is 80 percent: The quantity and quality of domain-appropriate labeled data governs performance. Data with varied lighting, angle, and occlusion helps generalization.
Class imbalance: When the ratio between common and rare classes is large, rare classes get ignored. Ease it with sampling, weighting, and augmentation.
Small-object problem: Small objects have few pixels in feature maps and are easy to miss. High-resolution input, multi-scale features (FPN), and tiled inference help.
Domain gap: When training data differs from the actual deployment environment, performance drops. Fine-tuning on deployment data or domain adaptation is needed.
Post-processing tuning: Values like NMS threshold and confidence threshold must be tuned to the goal (recall-first vs. precision-first).

   a typical detection pipeline (concept)

  data collection / labeling
    |
  preprocess/augment (crop, color jitter, mosaic, etc.)
    |
  train with pretrained backbone
    |
  measure mAP on validation + tune thresholds
    |
  fine-tune on deployment-environment data
    |
  lighten (quantize, etc.) then deploy
    |
  collect false positives/negatives in operation -> feed back to data

The most common mistake is assuming "the benchmark numbers are good, so our problem will be solved too." Always validate the gap between benchmark and real domain, and keep the habit of re-measuring on your own data.

Combining Detection and Segmentation

One big feature of recent trends is that detection, segmentation, and language combine like parts. Representative combinations are as follows.

Detection + SAM: Get object boxes with a detector, feed those boxes as SAM prompts to obtain precise masks. Detection handles "what is where," SAM handles "the pixel boundary."
Open-vocabulary detection + SAM: Instruct targets via text to get boxes, cut masks with SAM, then follow across frames with a tracker if needed.
Segmentation + depth/3D: Combining segmentation masks with depth information tells you "where and how far this object is," useful in robotics and AR.

Such combinations are possible because each part evolves independently yet connects through standard interfaces (boxes, masks, text embeddings). To understand the lineage is to know how these parts can be assembled.

The Real-Time Detection View

In practice, detection is often bound by latency. The factors that govern real-time detection are roughly as follows.

Backbone lightening: Raise throughput with reduced-compute backbones and efficient feature fusion.
Anchor-free and end-to-end elements: Simplify the pipeline by reducing post-processing and tuning burden.
Input resolution tuning: Lowering resolution is faster but hurts small-object performance, a trade-off.
Hardware and quantization: Lower precision (quantization) or optimize for a specific accelerator to raise throughput.

The reason the YOLO family has long been cited as the representative of real-time detection is that it balances these factors without heavily sacrificing accuracy. That said, which version is "best" depends on target latency, accuracy requirements, and deployment hardware, so it is safer to avoid absolute conclusions.

Strengths and Limits Summary

Approach	Strength	Limit
Faster R-CNN	Stable high accuracy	Relatively heavy for real-time
YOLO/SSD	Fast real-time inference	May struggle with tiny or dense scenes
DETR family	End-to-end, no NMS	Slow convergence in early versions (later fixed)
Mask R-CNN	Unified detection + instance segmentation	Heavy pipeline
SAM	Class-agnostic general segmentation	No semantic labels, heavy encoder
Open-vocabulary	Instruct via free text	Accuracy varies by domain and phrasing

The Lineage at a Glance

Organizing the arc so far as a chronological lineage of progress gives the following. Note what each stage absorbed from the previous stage into learning, and the big picture emerges.

   lineage of detection/segmentation progress (concept)

  [detection]
  R-CNN --> Fast R-CNN --> Faster R-CNN   (absorb proposals into RPN)
       \--> YOLO / SSD                    (remove proposals, one-stage)
            \--> DETR                     (remove anchors/NMS, set prediction)
                 \--> Deformable/DINO etc (better convergence, small objects)

  [segmentation]
  FCN --> U-Net                           (encoder-decoder, skips)
       \--> Mask R-CNN                    (detection + instance segmentation)
            \--> mask transformer family  (unified mask set prediction)

  [generalization]
  SAM (promptable segmentation)
  + open-vocabulary (instruct via text)
  = assembly-style vision system

The recurring pattern in this lineage is "replacing elements humans used to hand-tune with learnable parts." With this view, you can quickly place a newly appearing model at its spot in the lineage.

One more thing: this lineage is not a story of "the new completely replacing the old." Two-stage detectors like Faster R-CNN are still used where accuracy matters, the YOLO family remains the real-time standard, and the DETR family expands its ground with post-processing-free elegance. Each family sits on different trade-offs, and practice chooses by where among those trade-offs it wants to stand.

What to Choose in Which Situation

Accuracy first, latency to spare: Two-stage like Faster R-CNN or a detector with a strong backbone. For precise instance segmentation, the Mask R-CNN family.
Real-time essential (surveillance, robots, mobile): A one-stage detector like the YOLO family with a lightweight backbone and quantization.
Want to reduce post-processing tuning: The end-to-end approach of the DETR family. Consider training cost and convergence, though.
Need precise masks, flexible classes: The combination of finding targets with a detector and cutting masks with SAM.
Must handle new classes often: Text instructions via open-vocabulary detection. Domain validation is essential.

There is no single answer; usually you combine several parts to fit the goal. This is the practical reason to understand the lineage.

Conclusion

The evolution of detection and segmentation can be summed up as "a process of absorbing human-designed components into learning, one at a time." Proposals were replaced by the RPN, post-processing by set prediction, and fixed classes by text-aligned representations. SAM added the axis of "segment anything by prompt."

The key is not any model's momentary ranking, but that the concepts each model introduced recombine like parts, keeping the whole vision system flexible. Even when a new SOTA appears, understanding the mindset of this lineage lets you follow the change far faster.

Finally, I want to stress one attitude for practitioners. Rather than always grabbing the top model on a benchmark leaderboard, it is better to first clarify the trade-offs your problem demands (latency vs. accuracy, box vs. mask, fixed class vs. free instruction). Once that trade-off is set, choosing and assembling the right parts on the lineage organized here becomes far easier. Models keep changing, but the mindset of handling trade-offs endures.

References

DETR paper "End-to-End Object Detection with Transformers": arxiv.org/abs/2005.12872
Segment Anything paper: arxiv.org/abs/2304.02643
Segment Anything official page: segment-anything.com
Faster R-CNN paper: arxiv.org/abs/1506.01497
Mask R-CNN paper: arxiv.org/abs/1703.06870
U-Net paper: arxiv.org/abs/1505.04597
Deformable DETR paper: arxiv.org/abs/2010.04159
SSD paper: arxiv.org/abs/1512.02325
Segment Anything code: github.com/facebookresearch/segment-anything

Quiz

Q1: What is the fundamental difference between two-stage and one-stage detection?

Two-stage first generates candidate regions, then classifies each candidate (e.g., Faster R-CNN). One-stage predicts boxes and classes directly from the image with no proposal step (e.g., YOLO, SSD). One-stage is usually faster.

Q2: Why can DETR eliminate NMS?

Because during training Hungarian matching pairs predictions and ground truth one-to-one, so duplicate predictions don't arise. As a set-prediction method, it has no need to remove overlapping boxes afterward.

Q3: What distinguishes semantic, instance, and panoptic segmentation?

Semantic assigns only a class per pixel and does not separate instances. Instance separates individual objects into distinct masks. Panoptic unifies both: background (stuff) semantically and objects (things) as instances.

Q4: What is "promptable segmentation" as SAM describes it?

Given prompts like a point, box, or rough mask, it segments the target even without a class name. It aims at general-purpose segmentation not bound to a fixed set of classes.

Q5: Why do SAM's image encoder and mask decoder have different compute costs?

The image encoder is heavy but runs once per image, reusing the embedding. The mask decoder is lightweight, so multiple prompts on the same image can be processed quickly.

Q6: What is the core idea of open-vocabulary detection?

Treating classes as text embeddings rather than fixed integer labels, so targets not explicitly labeled during training can be indicated via free text. It leverages text-image aligned representations.

Q7: Name one limitation of SAM.

SAM only produces masks; it does not directly attach a semantic label like "this is X." Class recognition needs a separate model, and its heavy image encoder requires lightening for real-time and lightweight settings.

Q8: Why is it hard to assert absolute rankings when choosing a detection model?

Because accuracy and speed vary greatly with backbone, input resolution, training data, version, and deployment hardware. It is realistic to choose a family based on target latency and accuracy needs.

Q9: What do IoU and mAP each measure?

IoU measures the overlap (intersection/union) between predicted and ground-truth regions, a basic metric. mAP is the area under the precision-recall curve at various confidence thresholds averaged per class, the representative metric for detection and instance segmentation.

Q10: In the "backbone + neck + head" structure, what is the role of the neck (FPN, etc.)?

It fuses feature maps of different resolutions to handle small objects (high resolution) and large objects (low resolution) together. A core part of multi-scale detection.

Q11: Why does "high benchmark mAP" not guarantee solving the real problem?

Because the benchmark dataset's characteristics can differ from the actual deployment domain. Different lighting, angle, and class distribution hurt performance, so re-validating on your own data matters.

Q12: What is a typical way to combine a detector with SAM?

Get object boxes with a detector, then feed those boxes as SAM prompts to obtain precise pixel masks. The detector handles "what is where," SAM handles "the pixel boundary," a division of roles.