SOTA Real-Time Video Analysis — Tracking, Understanding, Efficient Inference

Introduction
Key Tasks of Video Understanding
The Architectural Arc of Video Models
- From 3D Convolutions to Spatio-Temporal Attention
Video Segmentation and Tracking: The SAM 2 Family View
Multi-Object Tracking (MOT)
- MOT Evaluation Metrics
- Two Branches of Tracking Paradigm
The Action Recognition View
Combining Video and Language
Methods of Temporal Modeling
The Difficulty of Data and Labeling
Real-Time Inference Optimization
Application Areas
Common Problems in Practice
The Hardware and Deployment View
How to Read Benchmarks
Strengths and Limits Summary
Conclusion
References
Quiz

Introduction

Understanding a single image and understanding a video that unfolds over time are different problems. Video adds a temporal axis; objects move, get occluded, and reappear. To answer "what is happening now" and "where did this object go in the next frame" in real time, you must weigh not only accuracy but also latency.

This article organizes the representative tasks and approaches of real-time video analysis, centered on architectural principles. Since video SOTA also shifts quickly, I avoid firm claims about any model's ranking or numbers and focus on concepts and the arc of families. Uncertain latest specs are treated at a generalized level.

There are four broad strands. First, the tasks video understanding divides into. Second, the architectural arc of video models that solve those tasks. Third, the SAM 2 family view that unifies segmentation and tracking, plus multi-object tracking. Fourth, the reality of optimization and deployment that makes real-time possible. Each strand ultimately converges on one question: "where to stand between accuracy and latency."

Key Tasks of Video Understanding

Video understanding splits into several tasks by goal. The representative ones are as follows.

Action recognition: Classify "what action is happening" in a short clip or stream. Assign labels like walking, running, or picking up an object.
Temporal action localization: In a long video, find "from when to when which action happened" as intervals.
Object tracking: Continuously follow a specific object across frames. Splits into single-object and multi-object tracking.
Temporal/video segmentation: Keep an object's pixel mask consistent across frames. It is image segmentation extended along the temporal axis.
Video question answering and captioning: Describe video content in language or answer questions. This is the realm of multimodal models.

   map of video understanding tasks (concept)

  classification axis: "what happens"     --> action recognition
  temporal axis:       "when it happens"  --> temporal localization
  tracking axis:       "where the object goes" --> object tracking (MOT/SOT)
  pixel axis:          "how far the boundary" --> video segmentation
  language axis:       "describe / answer in words" --> video-language model

These tasks are not independent. In autonomous driving, for example, tracking, segmentation, and action prediction mesh together within one pipeline.

Another point to stress is that video understanding is not simply "doing image understanding many times." The relationship between frames, that is temporal consistency, is central. If the same object is seen as different across frames, tracking collapses; if the mask wobbles frame to frame, segmentation becomes useless. So video models are asked for not only "accuracy" but "stability over time."

The Architectural Arc of Video Models

From 3D Convolutions to Spatio-Temporal Attention

Early video models used 3D convolutions, extending image 2D convolutions along the temporal axis. This scans space and time together to capture motion patterns. Later, as transformers were successfully applied to images (vision transformers), spatio-temporal attention was introduced to video as well.

        video transformer (concept)

  video = sequence of frames
    |
  split each frame into patches (space)
    |
  concatenate frames over time (time)
    |
  [spatio-temporal attention]
     - spatial attention: relations within one frame
     - temporal attention: relations across frames
    |
  classification / detection / segmentation head

The challenge of spatio-temporal attention is compute. When the number of frames multiplies by the number of patches, attention cost grows. Representative trade-offs to manage this are as follows.

Factorized attention: Compute spatial and temporal attention separately to turn a multiplicative cost into an additive one.
Token reduction and sampling: Reduce unimportant tokens or sample frames sparsely.
Windowed and local attention: Restrict attention to a local range instead of global.

These trade-offs are chosen along the tension between accuracy and speed. The "best" combination depends on the task and latency requirement.

Spatio-temporal attention is powerful because it can directly learn "relationships between distant frames." For example, even if an action's start and end are seconds apart, attention can connect the two moments. This contrasts with 3D convolution, which is mainly strong on local motion of adjacent frames. But the price of this expressiveness is compute, and real-time systems manage that price with the trade-offs above. This is the central tension of video architecture design.

Video Segmentation and Tracking: The SAM 2 Family View

If Segment Anything (SAM) opened promptable segmentation for images, extending that mindset to video followed naturally. The SAM 2 family can be understood as broadening "promptable segmentation of images" into "promptable segmentation and tracking of video."

The core ideas are as follows.

Once you specify a target with a prompt like a point or box in one frame, the model propagates that target's mask across subsequent frames.
A memory structure holding past information between frames lets the model reconnect the same target even after it is briefly occluded and reappears.
It aims at a streaming design that processes frames sequentially, so it can handle long videos.

     video promptable segmentation and tracking (concept)

  frame t=0: specify target with prompt (point/box) --> mask
       |
       v  (store target representation in memory)
  frames t=1..N: reference memory to propagate mask
       |
  occlusion/reappearance: reconnect same target via memory
       |
  result: target's mask stays consistent across all frames

The important point here is that "tracking and segmentation are bound into one." Traditionally tracking handled boxes and segmentation handled masks, but this family provides mask-level tracking in a unified way. That said, real-time performance, stability on long videos, and complex occlusion remain hard problems, and detailed performance varies with implementation and settings.

The role of the memory structure deserves special emphasis. When an object briefly leaves the screen and returns, without memory the model easily mistakes it for a new object. Remembering the target representation from past frames lets it reconnect the reappearing object as "that same target." This shares the purpose of the re-ID seen in MOT above, but differs in handling it with unified memory rather than a separate re-ID feature. That ideas developed separately across detection, tracking, and segmentation meet within one framework is an interesting point of recent trends.

Multi-Object Tracking (MOT)

Multi-Object Tracking (MOT), which tracks many objects at once, is a core axis of real-time video analysis. MOT's goal is not simply "finding objects" but assigning each object a consistent identity (ID) over time. Person 3 in frame 12 and person 3 in frame 400 must be the same person to be meaningful. A widely used approach is tracking-by-detection.

        tracking-by-detection

  each frame [detector] --> this frame's boxes
       |
  [associate] previous tracks with this frame's boxes
       - position prediction: estimate next location with Kalman filter, etc.
       - appearance matching: judge same object via re-ID features
       |
  same object keeps track ID, new object gets a new ID
       |
  result: attach a consistent ID to each object and track over time

Position prediction: A motion model like a Kalman filter predicts the next-frame location to narrow candidates.
Association: Pair predicted positions with detected boxes. Assignment methods like the Hungarian algorithm are used.
Re-identification (re-ID): Compare appearance features to reconnect a briefly occluded object under the same ID.

For reference, tracking splits by the number of targets. Single-object tracking (SOT) follows one target the user specifies in the first frame to the end, without needing to know in advance what the target is. Multi-object tracking (MOT) handles many targets at once, usually detecting and tracking all objects of predefined classes (person, car, etc.). The two tasks differ in goal and difficulty, and the detection-based tracking covered here is mainly the MOT approach.

The hard problems of MOT are ID switching (IDs swapping during occlusion or crossing) and dense scenes. A combination of a good detector, a robust motion model, and discriminative re-ID features governs performance. Which method is "best" depends on scene density, camera motion, and latency requirements, so it is safer to understand this in a generalized way.

MOT Evaluation Metrics

Unlike detection, MOT performance must also measure "how long and how consistently it followed." Organizing the representative metrics at a conceptual level:

Accurate localization/detection: Does it find objects well and get their positions right each frame?
ID consistency: Does it keep the same ID for the same object over time? Fewer ID switches is better.
Association quality: Did it pair predicted and ground-truth trajectories well across frames?

There are composite metrics that try to bundle these facets into one, but no metric is a cure-all. A case where not missing matters (surveillance) and a case where ID consistency matters (statistics) weight the metrics to watch differently.

Two Branches of Tracking Paradigm

Tracking can be split into two broad approaches.

Tracking-by-detection: The approach described above, detecting per frame then associating. It depends heavily on detector performance and is easy to modularize.
Joint detection and tracking: Trains detection and tracking together as one model. Approaches include directly propagating features across frames or representing tracks as transformer queries.

Transformer-based tracking can be seen as DETR's set-prediction mindset extended along the temporal axis. In the manner of "carry the previous frame's track queries into the next frame and update them," it tries to bind detection and association into one learned process. That said, which approach is superior depends on scene and latency requirements, so it is safer to understand this at the family level.

The Action Recognition View

Action recognition judges "what is happening" over time. Unlike still-image classification, motion itself is information. For example, "sitting down" and "standing up" are hard to distinguish from one frame, but become clear when you watch the flow of time.

Motion cues: Using a motion representation like optical flow together distinguishes actions better. Early models even processed appearance and motion as separate streams.
Time span: Short actions (waving) and long actions (cooking) need different time windows. A model must handle diverse temporal scales.
Skeleton-based action recognition: There is also an approach that extracts human joints (skeleton) and classifies actions by their motion. It has the advantage of being less sensitive to background and lighting changes.

   information sources of action recognition (concept)

  appearance (what is visible)  --> object/scene cues
  motion (how it moves)         --> optical flow, frame difference
  skeleton (how joints move)    --> skeleton sequence
       |
  combine these to judge "what action" over time

Action recognition is used in surveillance, sports, rehabilitation, human-robot interaction, and more. But due to the diversity of real environments (camera angle, occlusion, individual differences), benchmark performance often does not carry over directly to field performance.

Combining Video and Language

One axis of recent trends is connecting video to language. By encoding video and combining it with a language model, multimodal models enable the interaction of asking questions about video and getting answers.

Video captioning: Describe video content in sentences.
Video question answering: Answer questions like "what did the person pick up in the video?"
Temporal grounding: Find a specific moment via language, like "the instant the ball is thrown."
Action prediction and reasoning: There are also attempts to reason what will happen next and why an action was taken. This, however, remains a very hard task.
Long-video summarization: Summarize footage of tens of minutes or more down to the essentials. Maintaining temporal context is the crux.

Such abilities let vast video be searched, summarized, and analyzed without a person watching it all. But video has many frames so computation is heavy, and maintaining the temporal context of long footage is hard. In practice, techniques like frame sampling, key-segment selection, and hierarchical summarization ease this burden.

Methods of Temporal Modeling

The heart of video understanding is "how to put time into the model." Organizing the representative approaches:

Frame-independent + post-aggregation: Process each frame separately then gather results over time. Simple but can miss fine motion between frames.
3D convolution: Scans space and time together to directly capture local motion. Relatively weak on long-range temporal dependencies.
Recurrent structures: Update state each frame, carrying temporal information forward. Natural for streaming but hard for very long dependencies.
Temporal attention: Directly learns relationships between frames with attention. Strong on long-range dependency but heavy in compute.

   spectrum of temporal modeling approaches (concept)

  simple/light <--------------------> expressive/heavy
     |                                     |
  frame-independent+aggregate            temporal attention
  3D convolution (local)                 recurrent (streaming)

  practice: choose by the task's time scale and latency budget

No approach is unconditionally superior. Action classification of short clips and event understanding of long footage need different time scales, and real-time streaming and offline batch processing have different latency constraints. Approach choice is a function of these requirements.

The Difficulty of Data and Labeling

Video model performance depends heavily on data, yet video labeling is far more expensive than images.

Frame explosion: With dozens of frames per second, even short footage has very many frames to label.
Ambiguity of temporal labels: The boundary of "when an action starts and ends" varies by human judgment.
Labor intensity of tracking labels: Labeling by keeping IDs and masks for each object across frames is very hands-on.

For this reason, self-supervised and weakly-supervised learning, synthetic data, and the iterative method where the model drafts labels and humans review are widely used. This is why SAM-family promptable segmentation draws attention as a video labeling tool too. If a person gives only a few prompts, the model fills in the masks of the remaining frames, greatly cutting labeling cost.

Ultimately the data problem includes not only "how to train the model" but "how to make labels efficiently." A good video system often starts from a good labeling pipeline, and a cyclic structure appears where models (promptable segmentation and tracking) enter that pipeline itself as parts.

Real-Time Inference Optimization

Since video streams in at several frames per second, throughput and latency matter as much as accuracy. Throughput is "how many frames per second are processed," latency is "the time from a frame arriving to a result appearing." They differ. Larger batches raise throughput but can increase individual-frame latency. In real-time applications, the balance between the two matters. Here are representative strategies for securing real-time behavior.

Reducing Compute

Lightweight backbones: Lower per-frame cost with reduced-compute backbones and efficient feature fusion.
Frame sampling: Rather than processing every frame heavily, process only keyframes heavily and estimate or interpolate in between lightly.
Region-of-interest processing: Concentrate compute on regions where objects are likely, not the whole frame.
Cascade processing: Filter quickly with a light model and call a heavy model only in suspicious cases to lower average cost.

Representation and Precision Optimization

Quantization: Lower the precision of weights and activations to raise throughput and cut memory.
Distillation: Transfer a large model's knowledge to a small model so the small one performs better.
Pruning: Remove low-importance connections and channels to lighten the model.

Streaming and State Reuse

Causal processing: Produce outputs from past and present only, without waiting for future frames, reducing latency.
State reuse: Reuse a previous frame's features and memory in the next frame's computation to avoid redundant work.

These three axes (reducing compute, representation/precision optimization, streaming/state reuse) are not mutually exclusive and are used together. For example, reduce compute with a lightweight backbone, lower precision with quantization, and remove redundancy with state reuse, layering several techniques to secure real-time. But each technique can shave off a bit of accuracy, so how far to tolerate it is decided by the application's requirements. Optimization is never "free"; it is always a trade for something.

     real-time pipeline trade-offs (concept)

  accuracy  <----------------------->  speed
     |                                   |
  heavy backbone                       light backbone
  process every frame                  keyframes+interpolation
  high resolution                      low resolution
  global attention                     local/factorized attention
  high precision                       quantization

  practice: combine these axes to hit target latency/accuracy

The key is that there is no single right answer. Real-time alerting on a surveillance camera and offline sports analysis demand different latency and accuracy, so their optimal combinations differ too.

Application Areas

Surveillance and security: Real-time intrusion detection, anomaly detection, counting people and vehicles. Latency must be low and false-positive management matters. The scale problem of handling many cameras at once is also large.
Sports analytics: Tracking players and the ball, tactical analysis, automatic highlight generation. Density, fast motion, and occlusion are the difficulties. Keeping IDs for players in the same uniform is especially tricky.
Autonomous driving and robotics: Tracking surrounding objects and predicting future trajectories directly affects safety. Real-time behavior and robustness are essential, and it must not break even in rare (exceptional) scenes.
Media and editing: Cutting a specific object into a mask for editing or replacing backgrounds uses promptable segmentation and tracking. Here, precision and consistency take priority over real-time.
Medical and behavior analysis: Rehabilitation motion assessment, surgical video analysis, and more need fine temporal action understanding. Accuracy and explainability matter.

Each application prioritizes latency, accuracy, and robustness differently, and that priority drives the choice of architecture and optimization. For example, autonomous driving demands both latency and robustness to the extreme, while media editing wants precise masks even at the cost of time. Even under the same name "video analysis," requirements diverge this much.

Common Problems in Practice

Running a real-time video system reveals problems that theory does not surface well.

Occlusion and reappearance: When an object disappears behind another and returns, tracking easily breaks. Memory and re-ID ease it but not completely.
Motion blur and low light: Detection and tracking quality drop sharply in fast motion or dark environments.
Camera motion: A moving camera (drone, vehicle) makes the background itself flow, complicating the motion model.
Dense and similar appearance: Many similar-looking objects (crowds, same uniforms) cause frequent ID switching.
Drift: Errors accumulate across frames and tracking can gradually go off. Periodic re-detection corrects it.

   real-time tracking failure modes (concept)

  occlusion --> track break --> attempt reconnect via re-ID
  blur ------> detection fail --> hold with frame skip/interpolation
  camera move -> motion prediction error -> background correction needed
  dense -----> ID switching --> ease with strong appearance features

These problems are less about "solving perfectly" and more about "how well you mitigate." So practical systems also carry logic to detect and recover from failure (re-detection triggers, confidence-based discarding, etc.).

The Hardware and Deployment View

Real-time is decided not by the model alone but together with the deployment environment.

Edge vs. server: Processing near the camera (edge) is favorable for latency and bandwidth but has limited compute. Sending to a server uses strong models but incurs transmission delay.
Accelerator use: Optimization for GPUs and dedicated accelerators (operation fusion, batching, precision tuning) greatly changes throughput.
Pipeline parallelism: Overlap decoding, preprocessing, inference, and post-processing as a pipeline to raise throughput.
Input stream management: When handling many cameras at once, frame-drop and priority policies are needed.

Raising accuracy while ignoring the deployment view easily becomes useless because it cannot keep up with frames in reality. Which is better, an "accurate but late" result or a "less accurate but on-time" one, is decided by the application.

How to Read Benchmarks

There are points to note when reading the numbers in video analysis papers.

Look at latency together: Do not judge by "accuracy X%" alone. At what frame rate, resolution, and hardware that accuracy was achieved governs real-time behavior.
Dataset characteristics: Check how much the scenes the benchmark assumes (urban roads, indoors, sports, etc.) resemble your application.
Evaluation conditions: Difficulty differs by whether it is online (streaming) or offline (after seeing the whole video) evaluation. Being able to see future frames is generally easier.

Comparing only the table numbers while ignoring this context easily leads to wrong conclusions. Always ask "under what conditions was this number measured."

Especially, the term "real-time" itself is relative. Five frames per second is enough for some applications, while others need 30 or more. And even at the same frame rate, if processing latency accumulates, results lag behind the actual situation. So when a paper claims "real-time," it matters to check whether its criterion matches your application's.

Strengths and Limits Summary

Approach	Strength	Limit
3D convolution	Strong at local motion capture	Relatively weak on long-range dependencies
Spatio-temporal transformer	Global relations, long-range dependency	Heavy compute (trade-offs needed)
Detection-based MOT	Modular, flexible	Vulnerable to ID switching, dense scenes
Promptable segmentation/tracking	Mask-level unified tracking	Real-time and long-video stability are hard
Lightening/quantization	Secures real-time	Possible accuracy loss

Conclusion

Real-time video analysis is not explained by "accuracy" alone. As the temporal axis is added, tracking consistency, occlusion handling, and above all latency come into play. Architecture gains expressiveness through spatio-temporal attention, optimization earns speed through lightening and streaming, and practical systems are built in the trade-off between the two.

The mindset SAM 2 showed, "specify by prompt and propagate across frames," blurs the boundary between segmentation and tracking and is reorganizing the building blocks of video understanding. Even as new models keep appearing, understanding the structure of the tasks and the principles of the trade-offs lets you follow the change far more easily.

The advice I want to leave for practitioners is this. When designing a video system, rather than first hunting for "the most accurate model," first decide "how much latency we must tolerate and how much failure (occlusion, blur, ID switching) we can bear." Once that constraint is set, combining the temporal modeling approach, optimization techniques, and deployment location on top of it becomes far clearer. Most of the difficulty of video understanding comes from constraints of time and resources, and facing those constraints honestly is the starting point of a good system.

References

Segment Anything paper (image SAM): arxiv.org/abs/2304.02643
Segment Anything official page: segment-anything.com
Attention Is All You Need (transformer): arxiv.org/abs/1706.03762
ViT (vision transformer) paper: arxiv.org/abs/2010.11929
DETR (set-prediction detection) paper: arxiv.org/abs/2005.12872
I3D (video 3D convolution) paper: arxiv.org/abs/1705.07750
SORT (online multi-object tracking) paper: arxiv.org/abs/1602.00763
Segment Anything code: github.com/facebookresearch/segment-anything

Quiz

Q1: What is the difference between action recognition and temporal action localization?

Action recognition classifies "what action" in a clip or stream. Temporal localization additionally finds "from when to when" that action happened as intervals in a long video.

Q2: Why does compute become a problem in spatio-temporal attention?

Because the number of frames multiplies by patches per frame, growing attention cost. To ease it, one factorizes spatial and temporal attention, reduces tokens, or uses local attention.

Q3: What core concept does the SAM 2 family add for video?

Propagating the mask of a target specified by a prompt in one frame to subsequent frames, and using a memory structure to reconnect the same target across occlusion and reappearance. It unifies segmentation and tracking at the mask level.

Q4: What is the basic flow of tracking-by-detection?

Get boxes with a detector each frame, then associate with previous tracks via position prediction (motion model) and appearance matching (re-ID), keeping a consistent ID for the same object.

Q5: What is ID switching in MOT?

A problem where the ID of the original track swaps to a different object during occlusion or object crossing. It is especially prone in dense scenes and with similar appearances.

Q6: Name two ways to reduce compute for real-time behavior.

Using lightweight backbones, frame sampling that processes only keyframes heavily and interpolates in-between frames lightly, and concentrating compute on regions of interest.

Q7: Why does causal processing reduce latency?

Because it produces outputs from past and present information without waiting for future frames. It is advantageous when immediate responses are needed in a streaming setting.

Q8: Why is it hard to assert an "optimal combination" in video analysis?

Because each application prioritizes latency, accuracy, and robustness differently. Surveillance alerting and offline sports analysis demand different trade-offs, so the optimal settings differ.

Q9: What is the difference between tracking-by-detection and joint detection and tracking?

Tracking-by-detection detects per frame then associates separately (easy to modularize, detector-dependent). Joint detection and tracking trains both as one model, binding the two processes via cross-frame feature propagation or track queries.

Q10: Why do motion cues matter in action recognition?

Because many actions like "sitting down" and "standing up" are hard to distinguish from one frame. Using a motion representation like optical flow captures change over time and distinguishes actions better.

Q11: Why is an "accurate but late" result a problem in real-time video?

Because frames keep streaming; if processing cannot keep up with the frame rate, the lagged result points to a situation already past. For immediacy-critical applications like surveillance and autonomous driving, an on-time approximation can be more useful.

Q12: What is the trade-off between edge and server processing?

Edge (near the camera) is favorable for transmission latency and bandwidth but has limited compute. A server can use strong models but incurs transmission delay. The choice splits by the application's latency requirements and resource constraints.