- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- Why 3D Perception Is Hard
- Sensors That Recover Depth
- SLAM — Simultaneous Localization and Mapping
- 3D Representations — Point Clouds, Voxels, Occupancy
- 6D Pose Estimation
- Deep-Learning Perception — Detection and Segmentation
- NeRF and 3D Gaussian Splatting
- Coordinate Frames and Transforms — The Hidden Spine of Perception
- Sensor Fusion in Depth
- A Practical Pipeline Sketch
- How Do You Evaluate Perception Performance
- Monocular Depth Estimation — Seeing Depth with One Eye
- An Overview of Dynamic-Environment SLAM
- Pitfalls You Meet in Practice
- Closing
- References
Introduction
The moment a person opens their eyes, they grasp the layout of a room with no visible effort. They know where the desk is, how many steps to the door, whether the cup is within reach. For a robot, none of this is a given. What a camera hands over is a grid of pixels, an array of color numbers. Answering the question "where am I in 3D space, and what is around me" from that raw data is the heart of robot perception.
This article unpacks how a robot sees the world, step by step. We first look at which sensors recover depth, then explain SLAM, which stitches that data into a map. We continue with 3D representations such as point clouds and voxels, with 6D pose estimation that locates and orients objects, with deep-learning detection and segmentation, and finally with what the recently celebrated NeRF and 3D Gaussian Splatting mean for robot perception.
Accuracy comes first, but diagrams and small code snippets are included so the concepts land intuitively.
Why 3D Perception Is Hard
A 2D image is the result of projecting the 3D world onto a plane. Depth is lost in that projection, so a single pixel alone cannot tell you how far its point sits from the camera. This is called depth ambiguity.
World (3D) Image (2D)
┌───────────┐ ┌───────────┐
│ ● Object A (near) │ project │ ● │
│ ● Object B (far)┼────────▶ │ ● │ same pixel location,
│ │ │ │ depth unknown
└───────────┘ └───────────┘
every point along a camera ray maps to one pixel
Recovering depth requires extra information: the disparity between two eyes (stereo), the round-trip time of actively emitted light, or learned prior knowledge (monocular depth estimation). Robot perception can be summed up as the work of "reclaiming the lost dimension."
Sensors That Recover Depth
Stereo cameras
Like a pair of human eyes, two cameras are placed a fixed distance (the baseline) apart. The same object appears at slightly different positions in the two images; that difference is called disparity and is inversely proportional to depth.
Depth Z = (focal length f x baseline B) / disparity d
Left camera Right camera
│ \ / │
│ \ / │
│ \ / │
│ \ ● / │ ● the difference in pixel
│ \ / │ positions between left/right = disparity d
└───B (baseline)──┘
Larger disparity means the object is closer. There is compute cost, and correspondence is hard on textureless surfaces like blank walls, but because it is a passive sensor it works well outdoors.
RGB-D cameras
A sensor that provides both color (RGB) and depth. The common designs are structured light and Time of Flight (ToF). Structured light projects a known pattern onto the scene and reads its distortion; ToF measures the time light takes to return.
RGB-D output = color frame + aligned depth frame
pixel (u, v) ──▶ color (R, G, B)
└─▶ depth Z (meters)
Recover a 3D point using camera intrinsics:
X = (u - cx) * Z / fx
Y = (v - cy) * Z / fy
Z = Z
RGB-D is cheap and convenient indoors, but struggles with strong sunlight (infrared interference) and surfaces like glass or mirrors.
LiDAR
A rotating laser fires in every direction and measures distance from the return time. The result is a very precise 3D point cloud. Widely used in autonomous driving, it is robust to lighting and accurate at range. In exchange it carries no color and is relatively expensive.
| Sensor | Depth principle | Strengths | Limits |
|---|---|---|---|
| Stereo | Disparity of two views | Cheap, outdoor-capable, has color | Weak on textureless surfaces |
| RGB-D (structured/ToF) | Active light / round-trip | Precise indoors, has color | Weak in sunlight and on glass |
| LiDAR | Laser round-trip | Long-range, high precision, lighting-agnostic | Costly, no color |
Real robots combine these to cover each other's weaknesses. This is called sensor fusion.
SLAM — Simultaneous Localization and Mapping
SLAM (Simultaneous Localization and Mapping) is the problem a robot solves while roaming an unknown environment, answering two questions at once. One is "where am I now" (localization); the other is "what does my surroundings look like" (mapping).
The catch is that these form a chicken-and-egg relationship. With an accurate map you can find your position, and with an accurate position you can draw the map, but at the start you have neither. SLAM optimizes both estimates simultaneously.
The SLAM pipeline
Sensor input Front-end Back-end
┌────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Camera │──▶ │ Feature extract/ │──▶ │ Pose graph optim. │
│ LiDAR │ │ match, odometry │ │ Bundle adjustment │
│ IMU │──▶ │ (frame-to-frame) │ │ Apply loop closure│
└────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Update local │◀──────│ Align global │
│ map │ │ map │
└──────────────┘ └──────────────┘
│
▼
Loop closure detect ──▶ "been here before" → correct drift
- The front-end quickly estimates relative motion between consecutive frames. This is called visual odometry, and small errors accumulate every frame (drift).
- Loop closure is the step where the robot recognizes it has returned to a previously visited place. Realizing "this is the same hallway from earlier" provides the clue to correct accumulated error all at once.
- The back-end ties poses from many time steps into one large optimization to produce a globally consistent map. Bundle adjustment and pose graph optimization are the core tools.
Error accumulation and loop closure
With visual odometry alone, position estimates drift over time. You may loop 100 meters back to the start, yet on the map you appear several meters off. Loop closure detects this inconsistency and re-adjusts the entire pose graph, spreading the error out.
Before loop closure After loop closure
Start ● Start ●
\ \
\ (drifted │ (error distributed
\ apart) │ evenly along path)
● ← actually same spot ● ← aligned
3D Representations — Point Clouds, Voxels, Occupancy
To work with the depth its sensors provide, a robot must represent it in a suitable data structure.
Point clouds
A set of points scattered in 3D space. Each point carries coordinates (x, y, z) and sometimes color or reflection intensity. It is the most direct representation, but the point count is enormous, the points are unordered, and density is uneven.
import numpy as np
# Back-project a depth image into a point cloud
def depth_to_pointcloud(depth, fx, fy, cx, cy):
h, w = depth.shape
u, v = np.meshgrid(np.arange(w), np.arange(h))
z = depth
x = (u - cx) * z / fx
y = (v - cy) * z / fy
points = np.stack([x, y, z], axis=-1).reshape(-1, 3)
# Drop invalid points where depth is zero
valid = points[:, 2] > 0
return points[valid]
Voxels
A representation that divides 3D space into a uniform cubic grid. Think of it as the 3D version of a 2D pixel. It is regular and easy to handle, but memory grows cubically as resolution rises. To ease this, hierarchical structures such as octree-based OctoMap are used.
Occupancy grid
A map where each cell holds a probability of "occupied / free / unknown." It is especially useful in robot navigation because it directly expresses where the robot can travel.
Occupancy grid (2D slice example)
. = free space # = obstacle ? = unexplored
? ? ? ? ? ? ?
? . . . # # ?
? . R . # . ? R = robot's current position
? . . . . . ?
? # # . . . ?
? ? ? ? ? ? ?
| Representation | Strengths | Limits | Main use |
|---|---|---|---|
| Point cloud | Faithful to raw, precise | Unordered, large | Registration, reconstruction |
| Voxel | Regular, easy to process | Cubic memory growth | 3D deep learning, collision checks |
| Occupancy | Directly expresses traversability | Loses fine shape | Path planning, navigation |
6D Pose Estimation
For a robot that grasps objects (manipulation), knowing only where an object is on screen (2D) is not enough. It must know the 3D position (x, y, z) and the three axes of orientation (roll, pitch, yaw) to reach out accurately. Together these six degrees of freedom form the 6D pose (6-DoF pose).
6D pose = position 3 + rotation 3
z
│ ↺ yaw
│ /
│ /
└──────── y
/│
/ │ ↺ pitch
x │
↺ roll
To grasp: decide where (position) and in which orientation (rotation) to place the hand
There are two broad approaches: registering a known 3D model against the observation (e.g. ICP, Iterative Closest Point), and regressing the pose directly from the image with deep learning. In practice a hybrid is common: deep learning gives an initial estimate, and ICP refines it precisely.
Deep-Learning Perception — Detection and Segmentation
Once you know depth and shape, the next step is semantic perception, knowing "what it is."
- Object detection: locate objects in the image and predict bounding boxes and classes.
- Semantic segmentation: assign a class to every pixel (this pixel is floor, that pixel is a person).
- Instance segmentation: distinguish individual objects even within the same class (person 1, person 2).
Input image Detection Semantic segmentation
┌──────────┐ ┌──────────┐ ┌──────────┐
│ person cup│ ──▶ │ [person] box│ ──▶ │ per-pixel │
│ desk │ │ [cup] box │ │ paint │
└──────────┘ └──────────┘ └──────────┘
In robotics this is fused with 3D information. For example, overlaying the pixel region segmented as "cup" onto the depth frame yields the cup's 3D position, which can go straight into a grasp plan. Recently, prompt-based segmentation (e.g. the Segment Anything family) and other pre-trained general models are increasingly used as the front stage of robot perception. Still, in real-world deployment you must always verify degradation from lighting, occlusion, and domain shift.
NeRF and 3D Gaussian Splatting
Two techniques have recently shaken up the field of 3D reconstruction.
- NeRF (Neural Radiance Fields): learns a scene as a single neural network function from images taken at many viewpoints. Given an arbitrary point in space and a viewing direction, it outputs the color and density there, which are integrated along a ray to render an image from a new viewpoint. It enables very photorealistic reconstruction, but training and rendering were heavy.
- 3D Gaussian Splatting: represents a scene as a set of many 3D Gaussians (soft blobs with position, size, orientation, color, and opacity). It renders by projecting them onto the screen (splatting), which is far faster than NeRF and enables real-time rendering, drawing much attention.
Traditional NeRF 3D Gaussian Splatting
point/voxel continuous fn (MLP) set of 3D Gaussians
┌─┬─┬─┐ f(x, y, z, θ, φ) ○ ◦ ● ◦ ○ (each blob has
├─┼─┼─┤ ──▶ = (color, density) ──▶ ● ◦ ○ ● ◦ position, color, opacity)
└─┴─┴─┘ render by ray integral render by projection (fast)
From a robot's perspective these are appealing. They can build high-quality 3D maps, reconstruct photorealistic scenes for simulation, or predict viewpoints the robot has not yet seen. Research grafting Gaussian representations onto SLAM is indeed active. That said, such representations usually assume static scenes and still carry practical limits for moving objects or large real-time environments. When and where to adopt them should be judged carefully.
Coordinate Frames and Transforms — The Hidden Spine of Perception
The place where the most bugs hide in 3D perception is, surprisingly, coordinate frames. A robot system holds many frames at once. The world a camera sees is the camera frame, the robot's fingertip is the end-effector frame, the robot's torso is the base frame, and a world frame spans the whole room. To "grasp" an object, you must transform the object position perceived in the camera frame into the base frame in which the robot moves its arm.
Transform chain (moving an object from camera to robot base)
world frame (room)
│ T_world_base (where the robot is in the room)
▼
base frame (robot torso) ◀──── arm planning happens here
│ T_base_cam (where the camera is mounted on the robot)
▼
camera frame ────▶ object position (perceived here)
object_base = T_base_cam x object_camera
Such transforms are expressed as 4x4 homogeneous transformation matrices. Rotation and translation are packed into one matrix, so you can cross between frames by matrix multiplication alone. Robot middleware provides tools (e.g. ROS tf) that manage the relationships among these transforms as a tree.
Transform tree (tf tree)
world
└── base_link
├── camera_link
│ └── camera_optical_frame
└── arm_base
└── ... ── end_effector
each edge is one transform. the transform between any two frames is
obtained by multiplying matrices along the tree path.
Common mistakes in transforms are confusing direction (from which frame to which) and getting the axis order wrong when converting between rotation representations (Euler angles, quaternions, rotation matrices). Quaternions are widely used in robotics because they avoid gimbal lock, but they are hard for people to read intuitively.
Sensor Fusion in Depth
We said earlier that multiple sensors cover each other's weaknesses. This fusion has two broad layers.
- Low-level fusion: combine raw measurements directly. For example, align a camera image and LiDAR points into the same frame, so you assign depth to each pixel or color to each 3D point.
- High-level fusion: probabilistically combine the results each sensor estimated independently (e.g. object position).
The core tool in temporal fusion is a state-estimation filter. Notably the Kalman filter and its extensions are widely used for robot position and velocity estimation.
The two steps of a Kalman filter (repeated)
predict update
┌──────────────┐ ┌──────────────┐
│ roll state │ │ correct the │
│ forward with │──────────▶│ prediction │
│ motion model │ │ with new obs │
│ (uncertainty↑)│ │ (uncertainty↓)│
└──────────────┘ └──────────────┘
▲ │
└──────── next cycle ◀───────┘
key: weighted average of prediction uncertainty and observation
uncertainty → put weight on whichever is more trustworthy
An IMU (inertial measurement unit) gives acceleration and angular velocity very fast but drifts over time. Cameras and LiDAR are slow but give an absolute reference. Weaving the two through a filter (e.g. visual-inertial odometry, VIO) yields fast, stable pose estimation. This becomes a strong foundation for the SLAM front-end.
A Practical Pipeline Sketch
Tying the pieces so far into one flow, the typical perception-action loop of a mobile robot looks like this.
Perceive → plan → control loop
[sensors] ──▶ [preprocess] ──▶ [SLAM/state est] ──▶ [semantic perception]
camera denoise position + map detect/segment
LiDAR register/correct affordance
IMU │
▼
[control] ◀── [motion plan] ◀── [task plan] ◀── [scene understanding]
joint cmds collision-free what to do what is where
│
▼
robot acts ──▶ world changes ──▶ (loop repeats)
Each block is filled by the techniques covered in earlier sections. The key point is that this loop must run within a real-time budget. If perception takes 200 milliseconds and the robot moves at 1 meter per second, by the time it decides the world has already moved 20 centimeters. So in practice a design that splits perception into multiple cycles (a fast safety-reaction loop plus a slow semantic-understanding loop) is common.
# Conceptual perception-action loop (near-pseudocode example)
def perception_action_loop(robot):
while robot.is_active():
frame = robot.sensors.grab_synced() # time-synced sensor grab
cloud = preprocess(frame) # denoise, align frames
pose, local_map = slam.update(cloud) # localization + mapping
objects = detector.detect(frame.rgb) # deep-learning detection
objects = lift_to_3d(objects, frame.depth) # 2D detection to 3D
goal = task_planner.decide(objects, pose) # what to do
path = motion_planner.plan(goal, local_map) # collision-free path
robot.controller.follow(path) # obey force/speed limits
This code simplifies the skeleton of a real system. In practice each stage is split into a separate process communicating by messages, with dense exception handling like a safe stop on failure.
How Do You Evaluate Perception Performance
The phrase "perception works well" must be measurable. Different tasks use different metrics.
| Task | Representative metric | Meaning |
|---|---|---|
| Object detection | mAP (mean average precision) | precision-recall balance over thresholds |
| Semantic segmentation | mIoU (mean intersection/union) | overlap of predicted and ground-truth regions |
| Depth estimation | RMSE, absolute relative error | difference between estimated and true depth |
| SLAM | ATE (absolute trajectory error) | divergence of estimated and true path |
| 6D pose | ADD family | distance between model points placed by estimated vs true pose |
There is a caveat when reading metrics. A high benchmark score does not guarantee field performance. The dataset may be biased, and the lighting and object distribution the robot actually meets may differ from the benchmark. So mature teams take, alongside benchmark scores, the success rate on their own robot's real mission as the final metric.
Monocular Depth Estimation — Seeing Depth with One Eye
There is also a way to estimate depth with a single camera, without stereo or LiDAR. Monocular depth estimation predicts depth directly from an image with deep learning. Instead of physical principles, it uses prior knowledge learned from vast data. For instance, the network internalizes cues like "distant objects look small, the horizon is far, and familiar objects (doors, chairs) have roughly fixed sizes."
Principle of monocular depth estimation (intuition)
a single image ──▶ neural net ──▶ per-pixel depth map
│
├─ cue 1: object size (reference familiar objects)
├─ cue 2: perspective, vanishing point
├─ cue 3: blur, occlusion relations
└─ cue 4: density change of texture
note: absolute scale can be ambiguous (often only relative depth is certain)
The limit of the monocular approach is scale ambiguity. From a photo alone it is in principle hard to tell "a small object nearby" from "a large object far away." So in robotics, monocular estimation is often combined with other sensors, or used together with the viewpoint change (motion parallax) produced by robot motion.
An Overview of Dynamic-Environment SLAM
We noted earlier that most classical SLAM assumes a static world. But reality has walking people, rolling carts, and opening doors. Such dynamic elements disturb SLAM in two ways. First, they contaminate the map by wrongly inserting feature points of moving objects. Second, they mislead self-position by using a moving background as reference.
Approaches to dynamic SLAM
1) Detect: what is moving?
- identify dynamic candidates like "person, vehicle" by segmentation
- detect outliers by geometric-consistency violation
│
▼
2) Separate: static background vs dynamic object
- use only static features for SLAM (stable map)
- track dynamic objects separately or ignore them
│
▼
3) (optional) Track: predict dynamic objects separately
- predict where a person will go and use it for avoidance
The core idea is a balance: filter dynamic elements as outliers to protect the static map, yet do not ignore those dynamic elements entirely for safety. Build the map from the static background, and pass moving targets like people to the avoidance plan.
Pitfalls You Meet in Practice
- Calibration: even a small error in a sensor's intrinsic or extrinsic parameters throws off 3D reconstruction badly. Camera-to-LiDAR extrinsic calibration is especially finicky.
- Time synchronization: if timestamps across sensors drift apart, data points refer to different instants while the robot moves fast, and registration breaks.
- Dynamic environments: most classical SLAM assumes a static world. When people or vehicles move, they must be filtered as outliers.
- Real-time constraints: perception must finish within the robot's control cycle. The balance between accuracy and latency is a constant concern.
- Domain shift: a perception model that worked in the lab often collapses at a site with different lighting, materials, and backgrounds. The habit of validating on field data before deployment matters.
Closing
Robot perception begins with the work of "reclaiming the lost third dimension," then stitches that space into a map (SLAM), represents it suitably (point cloud, voxel, occupancy), estimates object poses (6D pose), and assigns meaning (detection, segmentation) across several layers. New representations such as NeRF and 3D Gaussians keep widening the picture.
The key point is that no single technique is a silver bullet. Each sensor has different strengths and weaknesses, and each representation has a different use. A good robot perception system combines them to fit the problem and keeps refining while steadily verifying against real-world pitfalls. In the next article, we look at how a robot that sees the world this way can act safely.