SOTA Autonomous Driving Perception — BEV, Occupancy, End-to-End

Introduction
The Autonomous Driving Stack
The Big Picture: Evolution of Perception Representations
The Fundamental Perception Problem
- From 2D Images to a 3D World
BEV Perception
- What BEV Is
- From Images to BEV
- Using Temporal Information
Occupancy Networks
- Boxes Are Not Enough
- 3D Occupancy Representation
Vision-Centric vs LiDAR Fusion
- LiDAR Fusion Approach
- Vision-Centric Approach
Division of Sensor Roles
End-to-End Learning
- From Pipeline to Unified Learning
- Trade-offs of End-to-End
World Models and Simulation
- Why Simulation
- World Models
3D Reconstruction and Simulation Tech
Safety and the Long-Tail Problem
Sensor Calibration and Coordinate Frames
What Perception Output Contains
Multi-Object Tracking
Continuing to Prediction and Planning
Datasets and Benchmarks
Comparison: Perception Representations
Limitations and Caveats
Closing
References

Introduction

Autonomous driving is one of the most challenging physical-world problems AI faces. Dozens of times per second, it must understand the situation hundreds of meters around, predict how other cars and pedestrians will move, plan a safe route, and actually move the vehicle. A single mistake is life-threatening, so the bar for accuracy and safety is extremely high.

This article examines the latest architectures of autonomous driving systems, especially the perception stage that understands the world. We cover BEV (Bird's-Eye-View) perception, occupancy networks, the vision-centric vs LiDAR-fusion debate, end-to-end learning, and world models. Because the details of commercial systems are often undisclosed and change fast, we describe things carefully around architectural principles and publicly known concepts rather than asserting any company's exact specs.

The Autonomous Driving Stack

Autonomous driving software is usually understood as a multi-stage pipeline.

[sensors]  camera / LiDAR / radar / GPS / IMU
   |
   v
[perception]  detect surrounding objects, lanes, 3D structure
   |
   v
[prediction]  predict future motion of cars/pedestrians
   |
   v
[planning]  decide a safe, efficient route/behavior
   |
   v
[control]  generate steering/throttle/brake commands
   |
   v
[vehicle actuation]

This article focuses on perception. If perception is wrong, downstream prediction, planning, and control all go astray, so perception is the foundation of overall safety. That said, an end-to-end approach unifying these stages is also rising, which we address later.

For reference, autonomous driving's "levels of autonomy" are commonly split into several stages: from the driver doing everything, to the system driving under specific conditions while a human stays ready, to needing almost no human intervention within those conditions. As the level rises, demands on the accuracy of perception and judgment climb sharply. This article focuses on the principles of perception technology common to any level rather than asserting a specific product's level.

The Big Picture: Evolution of Perception Representations

Before the details, here is how perception representations evolved at a glance.

[evolution of perception representations]
 2D detection (boxes on the image)
        ->
 3D bounding boxes (per-object 3D boxes)
        ->
 BEV (unified top-down map)
        ->
 Occupancy (3D occupancy including height)
        ->
 End-to-end (unified perception-prediction-planning learning)

Two directions define this trend. First, representations became increasingly 3D and category-agnostic. Second, the field shifted from optimizing perception alone to co-optimizing the whole driving objective. Below we examine each stage in turn.

The Fundamental Perception Problem

From 2D Images to a 3D World

A camera captures a 2D projection of the 3D world. But driving needs 3D spatial information like "that car is 20 meters ahead and 3 meters to the right." So one core task of perception is reconstructing 3D space from 2D images.

[multiple camera images (2D)]
 front, rear, left, right, etc.
        |
        |  each with a different viewpoint and distortion
        v
[unified 3D spatial representation]
 object positions in one consistent coordinate frame

Because several cameras look in different directions, merging them into one consistent coordinate frame is important. The BEV representation next solves this elegantly.

BEV Perception

What BEV Is

BEV stands for Bird's-Eye-View, a top-down view as seen from above. In autonomous driving, BEV perception transforms multiple camera images into a single top-down map.

[multi-direction camera images]        [BEV representation (top-down map)]

  front cam ^                                 N
  left ->  car  <- right         ->       W [ego] E
  rear cam v                                  S
                                surrounding cars/lanes on a grid map

BEV's advantages are clear. Planning and control are ultimately "where to go on a map," so a top-down map connects naturally to the next stages. It also merges multiple cameras into one frame, handling objects consistently without duplication.

From Images to BEV

The key technique moves 2D features from several cameras onto a BEV grid. Two main directions are known.

[Direction A: forward projection (Lift-Splat family)]
 predict a depth distribution per pixel, lift to 3D,
 and splat onto the BEV grid

[Direction B: backward query (attention/transformer family)]
 each BEV grid cell queries "which camera pixels should I look at"
 via attention to fetch features

Direction A is known as the LSS (Lift, Splat, Shoot) family; Direction B is like BEVFormer, where BEV queries reference image features via transformer attention. Both aim to fuse multiple cameras into one BEV.

Using Temporal Information

BEV perception fuses not only multiple cameras at one instant but also information from past frames to improve performance. Temporal context matters for handling object motion (speed) or briefly occluded objects. Aligning and stacking BEV features across time helps distinguish static from moving objects and estimate velocity.

Occupancy Networks

Boxes Are Not Enough

Traditional 3D detection represents objects as 3D bounding boxes: "a car here, a pedestrian there." But roads have many objects that fit no predefined category. Debris on the road, oddly shaped construction vehicles, protruding branches — these are hard to express as "car" or "pedestrian" boxes.

[bounding box approach]
 detect only predefined categories (car, person) as boxes
 -> may miss out-of-category objects or odd shapes

[occupancy approach]
 divide space into a 3D grid (voxels) and
 predict "occupied / empty" per cell
 -> understand occupancy regardless of category

3D Occupancy Representation

An occupancy network divides surrounding space into small 3D grid cells (voxels) and predicts whether each is occupied (something is there) or not. It may also predict what kind of thing occupies a cell (vehicle, road, building). This way, even objects outside predefined categories register as "space is blocked there," which helps safety.

Occupancy can be seen as an extension of BEV. If BEV is a top-down 2D map, occupancy is a 3D occupancy map including height. This representation gained attention in recent perception research, and similar concepts are reportedly used in commercial systems.

Vision-Centric vs LiDAR Fusion

A long-standing debate in perception is sensor configuration, with two broad camps.

LiDAR Fusion Approach

LiDAR measures distance directly with lasers to obtain precise 3D point clouds. Many companies use multi-sensor fusion combining LiDAR, cameras, and radar.

[LiDAR fusion approach]
 LiDAR (precise range) + camera (color/texture) + radar (weather/speed)
 combine sensor strengths for robust perception
 - pros: accurate range, strong in the dark
 - cons: LiDAR cost, complex sensor calibration

Vision-Centric Approach

Conversely, a vision-only approach performs perception with cameras alone. Tesla is well known for publicly pursuing a camera-centric approach. However, each company's exact sensor configuration and algorithm details change over time and are disclosed only partially, so here we cover only the fact that vision-centric approaches exist and their conceptual trade-offs.

[vision-centric approach]
 perform 3D perception mainly with cameras
 - pros: lower sensor cost, human-like vision-based
 - cons: cannot measure depth directly, relies on estimation
        (reconstructs 3D via BEV/occupancy above)

In vision-centric approaches, depth and 3D structure must be estimated from camera images by a neural network. This makes BEV transformation and occupancy prediction especially important. Which side is superior is hard to assert; it is fairer to see it as different judgments about balancing cost, safety, and scalability.

Division of Sensor Roles

The main sensors used in autonomous driving each excel at different things. Using several together lets them cover each other's weaknesses.

[characteristics of main sensors]
 camera : strong at color/texture/sign recognition, range is indirect
          weak in darkness/backlight/bad weather
 LiDAR  : strong at precise 3D range measurement
          affected by rain/snow/fog, high cost
 radar  : strong at speed (Doppler) measurement and bad weather
          low resolution, weak at shape understanding

For example, at night cameras are weak but LiDAR and radar compensate; in a blizzard LiDAR is weak but radar holds up relatively well. The vision-centric approach focuses on the camera and instead fills range information via the neural-network-based 3D reconstruction (BEV, occupancy) seen earlier. Which combination is best depends on cost, safety goals, and operating environment.

End-to-End Learning

From Pipeline to Unified Learning

We introduced the perception-prediction-planning-control pipeline. It is easy to understand because each stage can be developed and validated separately, but information is lost between stages, and each part's objective can diverge from final driving quality.

So end-to-end approaches, learning from sensor input to driving action in one neural network, have gained attention.

[traditional modular]
 sensors -> [perception] -> [prediction] -> [planning] -> [control] -> action
            each module developed/validated separately

[end-to-end]
 sensors -> [one large neural network] -> action
            intermediate representations formed by learning
            (perception/prediction/planning connected differentiably)

Trade-offs of End-to-End

The appeal of end-to-end is that everything is optimized together toward the final goal (safe, comfortable driving). It reduces information loss between stages and learns behavior from data without hand-crafted rules. Recent research proposes connecting perception, prediction, and planning in one differentiable structure while keeping interpretable intermediate representations (e.g., BEV, occupancy) for transparency and performance together.

However, end-to-end is hard to interpret and validate, and it is hard to explain why the system acted a certain way in rare hazardous situations. In a safety-critical field, this interpretability challenge is very important.

For example, in a modular system it is easy to pinpoint which stage failed, as in "perception missed a pedestrian and caused a crash." In pure end-to-end, it is hard to look inside the network to see why it decided as it did. So in practice, rather than a fully monolithic design, a compromise that keeps interpretable intermediate representations for perception and planning while training the whole together tends to be preferred.

World Models and Simulation

Why Simulation

Training and validating autonomous driving only on real roads would require enormous mileage, and you cannot deliberately create dangerous situations. So simulation is essential. In virtual environments, diverse situations (bad weather, sudden cut-ins, rare accidents) are safely and repeatedly generated for training and validation.

World Models

A step further is the world model, a learned model that predicts "how will the world change if I act this way." That is, a generative model predicting future sensor observations or scenes, used to plan within it or to augment data.

[role of a world model]
 current state + assumed action
        |
        v
[world model]  predicts future scenes/observations
        |
        v
"if I take this action, this situation arises" simulated ahead
 -> used for plan review, generating rare-situation data

World models are a hot topic recently not only in driving but across physical-world AI, including robotics. But predictions are not always accurate, so handling the sim-to-real gap remains a challenge.

3D Reconstruction and Simulation Tech

Making simulation more realistic requires reconstructing real roads in precise 3D. Recently, 3D reconstruction techniques like 3D Gaussian Splatting and Neural Radiance Fields (NeRF) have drawn attention here.

[uses of 3D reconstruction]
 real driving footage (multiple viewpoints)
        |  3D reconstruction (Gaussian Splatting/NeRF family)
        v
 realistic 3D scene
        |  render from new viewpoints/conditions
        v
 generate simulation data (different angles, different weather)

Reconstructing a real scene in 3D lets you create new situations by moving the camera or adding objects within it. Combined with the world models and simulation seen earlier, this can augment rare-situation data. But a reconstructed scene cannot perfectly capture real physics (reflections, shadows, materials), so the gap with reality still needs care.

Safety and the Long-Tail Problem

The hardest part of driving is not common situations but rare ones, the long-tail problem.

[situation frequency]
  many |#########  everyday driving (straight, stop, lane change)
       |####
       |##
  few  |#         rare situations (unusual road objects, sudden accidents,
       |          strange weather, unpredicted pedestrian behavior)
       +---------------------------------------------
             common situations       rare situations (long-tail)

Everyday driving is well learned thanks to abundant data, but rarely occurring hazardous situations have little data and are hard to learn. Yet safety is decided precisely in these rare situations. Category-agnostic representations like occupancy, rare-situation generation via world models, and large-scale data collection are all efforts to ease the long-tail problem.

For safety, beyond perception performance, system-level design is also needed: acting conservatively under uncertainty, cross-checking multiple sensors, and leaving room for human intervention.

Sensor Calibration and Coordinate Frames

To merge multiple sensors into one 3D world, you must know exactly where and how each sensor is mounted. This is handled by sensor calibration and coordinate frames.

[hierarchy of coordinate frames]
 camera frame (per camera)
        |  extrinsic parameters (position/orientation)
        v
 vehicle frame (ego-relative)
        |  localization
        v
 world frame (map-relative)

Each sensor needs two kinds of calibration. Intrinsic parameters are the camera's own properties like focal length and distortion; extrinsic parameters are where and in what orientation the sensor is mounted on the vehicle. If this is inaccurate, objects appear misaligned when merging multiple cameras into BEV. So accurate calibration is a hidden foundation of perception quality.

Localization, knowing where the ego vehicle is on the map right now, also matters. GPS alone lacks precision, so the position is estimated precisely by matching maps against sensor observations or by fusing multiple sensors.

What Perception Output Contains

The information perception passes to the next stage is richer than a simple "object list." To summarize:

[what perception output contains]
 - dynamic objects: position/size/orientation/speed of cars, pedestrians, bikes
 - static structure: lanes, stop lines, crosswalks, curbs
 - signals/signs: traffic light state, speed limit signs
 - free space: drivable area (can be expressed via occupancy)
 - uncertainty: confidence for each piece of information

The last item, uncertainty, matters especially. If perception honestly says "there is something there, but I am only 60 percent sure," planning can respond conservatively. Conversely, if perception is overconfident without basis, it leads to dangerous decisions. So modern perception systems are designed to output not just results but also the confidence in those results.

Multi-Object Tracking

Perception does not end at detecting objects at one instant. Multi-object tracking, linking "is that car the same one as before" across frames, is needed. Only with tracking can you know an object's speed and heading, which makes prediction possible.

[detection and tracking]
 frame t   : detect objects A, B, C
 frame t+1 : detect objects A', B', C'
        |  link the same objects (data association)
        v
 trajectories: A stays A, speed/heading estimable

The core challenge of tracking is data association, correctly matching objects from the previous frame with the current one. Matching gets hard when objects are briefly occluded or pass close to each other. The BEV representation seen earlier handles multiple time points in one frame, which also helps tracking.

Continuing to Prediction and Planning

If perception is "what is where now," prediction is "how will it move next." Predicting the future trajectories of other cars and pedestrians is very hard, because human intent is uncertain and several possibilities coexist.

[uncertainty in prediction]
 the car ahead approaches an intersection

 possibility 1: go straight (high probability)
 possibility 2: turn right (blinker on)
 possibility 3: stop (waiting at signal)

 -> not asserting one; express multiple scenarios with probabilities

So modern prediction models output not one future but several possible futures with probabilities. Planning then considers these possibilities and picks a route with a safety margin. Perception-prediction-planning chain together this way, and since earlier-stage uncertainty propagates downstream, each stage honestly expressing its own uncertainty matters.

Datasets and Benchmarks

Perception research has advanced on public datasets and benchmarks. Representative ones exist.

[representative public datasets (concept)]
 - nuScenes  : multi-camera + LiDAR + radar, 3D detection/tracking
 - Waymo Open: large-scale multi-sensor driving data
 - KITTI     : an early representative driving benchmark

Such datasets record the same scene with multiple sensors and provide human-labeled 3D positions and classes of objects. Researchers can fairly compare methods on the same data. But public datasets can skew toward certain regions and conditions, so working well here does not mean safety on all roads. The long-tail problem seen earlier applies here too.

There are metrics that quantify perception performance. In 3D object detection, precision and recall are computed based on how much predicted and ground-truth boxes overlap, and a summarizing mean Average Precision (mAP) is often used.

[concept of 3D detection evaluation]
 predicted box vs ground-truth box
   - if position/size/orientation match enough -> true positive (TP)
   - otherwise -> false positive (FP) or false negative (FN)
        |
        v
 precision-recall curve -> mean Average Precision (mAP)
 (composite metrics also consider distance error, velocity error, etc.)

Such metrics are useful for comparing methods, but benchmark scores do not always match real-road safety. Rare hazardous situations are underrepresented in datasets and barely surface in metrics. So treat metrics as a reference and handle safety validation through separate, rigorous procedures.

Comparison: Perception Representations

Representation	Form	Strengths	Notes
3D bounding box	per-object boxes	clear, easy to handle	weak on out-of-category objects
BEV	top-down 2D map	connects naturally to planning	lacks height info
Occupancy	3D occupancy grid	category-agnostic, arbitrary shapes	high compute
End-to-end rep	learned internal rep	whole-system optimization	hard to interpret/validate

The table is a conceptual comparison; real systems often use several representations together.

Limitations and Caveats

Accuracy and disclosure: commercial systems' details are often undisclosed. It is safer to understand via public concepts than to assert a company's exact implementation.
Long-tail and safety: rare hazardous situations decide safety and remain unsolved.
Sim-to-real gap: simulation and world models are powerful but cannot fully erase the gap with reality.
Interpretability: end-to-end performs well but is hard to explain, challenging safety validation.
Recency: this field's SOTA and commercial systems change very fast. This article is for understanding principles; verify specifics in official sources.
Social acceptance: beyond performance, regulation, liability, and public trust are major variables for real deployment.

Closing

Autonomous driving perception has evolved around reconstructing the 3D world from 2D camera images. The flow runs through BEV, which merges multiple cameras into one top-down map; occupancy, which represents spatial occupancy regardless of category; and end-to-end, which optimizes the whole system as one.

Three takeaways: first, perception's fundamental task is reconstructing 3D from 2D, and BEV and occupancy are powerful representations for it. Second, vision-centric and LiDAR fusion are less about superiority than different judgments on cost, safety, and scalability. Third, driving's real challenge is not common but rare long-tail situations, where safety is decided. Specific specs move fast, but these principles and a safety-first attitude endure.

References

Lift, Splat, Shoot, LSS (arXiv 2008.05711): arxiv.org/abs/2008.05711
BEVFormer (arXiv 2203.17270): arxiv.org/abs/2203.17270
nuScenes autonomous driving dataset (arXiv 1903.11027): arxiv.org/abs/1903.11027
Planning-oriented Autonomous Driving, UniAD (arXiv 2212.10156): arxiv.org/abs/2212.10156
PointPillars: LiDAR 3D detection (arXiv 1812.05784): arxiv.org/abs/1812.05784
CARLA autonomous driving simulator: carla.org
nuScenes official site: nuscenes.org
Waymo Open Dataset: waymo.com/open