The Robots Eye — 3D Perception and SLAM

Introduction
Why 3D Perception Is Hard
Sensors That Recover Depth
SLAM — Simultaneous Localization and Mapping
- The SLAM pipeline
- Error accumulation and loop closure
3D Representations — Point Clouds, Voxels, Occupancy
6D Pose Estimation
Deep-Learning Perception — Detection and Segmentation
NeRF and 3D Gaussian Splatting
Coordinate Frames and Transforms — The Hidden Spine of Perception
Sensor Fusion in Depth
A Practical Pipeline Sketch
How Do You Evaluate Perception Performance
Monocular Depth Estimation — Seeing Depth with One Eye
An Overview of Dynamic-Environment SLAM
Pitfalls You Meet in Practice
Closing
References

Introduction

The moment a person opens their eyes, they grasp the layout of a room with no visible effort. They know where the desk is, how many steps to the door, whether the cup is within reach. For a robot, none of this is a given. What a camera hands over is a grid of pixels, an array of color numbers. Answering the question "where am I in 3D space, and what is around me" from that raw data is the heart of robot perception.

This article unpacks how a robot sees the world, step by step. We first look at which sensors recover depth, then explain SLAM, which stitches that data into a map. We continue with 3D representations such as point clouds and voxels, with 6D pose estimation that locates and orients objects, with deep-learning detection and segmentation, and finally with what the recently celebrated NeRF and 3D Gaussian Splatting mean for robot perception.

Accuracy comes first, but diagrams and small code snippets are included so the concepts land intuitively.

Why 3D Perception Is Hard

A 2D image is the result of projecting the 3D world onto a plane. Depth is lost in that projection, so a single pixel alone cannot tell you how far its point sits from the camera. This is called depth ambiguity.

     World (3D)                    Image (2D)
   ┌───────────┐               ┌───────────┐
   │    ●  Object A (near)  │  project  │    ●      │
   │       ●  Object B (far)┼────────▶ │    ●      │  same pixel location,
   │                        │           │           │  depth unknown
   └───────────┘               └───────────┘
        every point along a camera ray maps to one pixel

Recovering depth requires extra information: the disparity between two eyes (stereo), the round-trip time of actively emitted light, or learned prior knowledge (monocular depth estimation). Robot perception can be summed up as the work of "reclaiming the lost dimension."

Sensors That Recover Depth

Stereo cameras

Like a pair of human eyes, two cameras are placed a fixed distance (the baseline) apart. The same object appears at slightly different positions in the two images; that difference is called disparity and is inversely proportional to depth.

Depth Z = (focal length f x baseline B) / disparity d

  Left camera     Right camera
    │  \           /  │
    │   \         /   │
    │    \       /    │
    │     \  ●  /     │   ● the difference in pixel
    │      \   /      │     positions between left/right = disparity d
    └───B (baseline)──┘

Larger disparity means the object is closer. There is compute cost, and correspondence is hard on textureless surfaces like blank walls, but because it is a passive sensor it works well outdoors.

RGB-D cameras

A sensor that provides both color (RGB) and depth. The common designs are structured light and Time of Flight (ToF). Structured light projects a known pattern onto the scene and reads its distortion; ToF measures the time light takes to return.

RGB-D output = color frame + aligned depth frame

  pixel (u, v) ──▶ color (R, G, B)
               └─▶ depth Z (meters)

  Recover a 3D point using camera intrinsics:
    X = (u - cx) * Z / fx
    Y = (v - cy) * Z / fy
    Z = Z

RGB-D is cheap and convenient indoors, but struggles with strong sunlight (infrared interference) and surfaces like glass or mirrors.

LiDAR

A rotating laser fires in every direction and measures distance from the return time. The result is a very precise 3D point cloud. Widely used in autonomous driving, it is robust to lighting and accurate at range. In exchange it carries no color and is relatively expensive.

Sensor	Depth principle	Strengths	Limits
Stereo	Disparity of two views	Cheap, outdoor-capable, has color	Weak on textureless surfaces
RGB-D (structured/ToF)	Active light / round-trip	Precise indoors, has color	Weak in sunlight and on glass
LiDAR	Laser round-trip	Long-range, high precision, lighting-agnostic	Costly, no color

Real robots combine these to cover each other's weaknesses. This is called sensor fusion.

SLAM — Simultaneous Localization and Mapping

SLAM (Simultaneous Localization and Mapping) is the problem a robot solves while roaming an unknown environment, answering two questions at once. One is "where am I now" (localization); the other is "what does my surroundings look like" (mapping).

The catch is that these form a chicken-and-egg relationship. With an accurate map you can find your position, and with an accurate position you can draw the map, but at the start you have neither. SLAM optimizes both estimates simultaneously.

The SLAM pipeline

 Sensor input        Front-end                  Back-end
 ┌────────┐    ┌──────────────────┐    ┌──────────────────┐
 │ Camera │──▶ │ Feature extract/  │──▶ │ Pose graph optim. │
 │ LiDAR  │    │  match, odometry  │    │ Bundle adjustment │
 │ IMU    │──▶ │ (frame-to-frame)  │    │ Apply loop closure│
 └────────┘    └────────┬─────────┘    └────────┬─────────┘
                        │                        │
                        ▼                        ▼
                 ┌──────────────┐        ┌──────────────┐
                 │ Update local │◀──────│ Align global │
                 │     map      │        │     map      │
                 └──────────────┘        └──────────────┘
                        │
                        ▼
                 Loop closure detect ──▶ "been here before" → correct drift

The front-end quickly estimates relative motion between consecutive frames. This is called visual odometry, and small errors accumulate every frame (drift).
Loop closure is the step where the robot recognizes it has returned to a previously visited place. Realizing "this is the same hallway from earlier" provides the clue to correct accumulated error all at once.
The back-end ties poses from many time steps into one large optimization to produce a globally consistent map. Bundle adjustment and pose graph optimization are the core tools.

Error accumulation and loop closure

With visual odometry alone, position estimates drift over time. You may loop 100 meters back to the start, yet on the map you appear several meters off. Loop closure detects this inconsistency and re-adjusts the entire pose graph, spreading the error out.

Before loop closure               After loop closure
   Start ●                          Start ●
        \                              \
         \  (drifted                    │  (error distributed
          \   apart)                    │   evenly along path)
           ● ← actually same spot       ● ← aligned

3D Representations — Point Clouds, Voxels, Occupancy

To work with the depth its sensors provide, a robot must represent it in a suitable data structure.

Point clouds

A set of points scattered in 3D space. Each point carries coordinates (x, y, z) and sometimes color or reflection intensity. It is the most direct representation, but the point count is enormous, the points are unordered, and density is uneven.

import numpy as np

# Back-project a depth image into a point cloud
def depth_to_pointcloud(depth, fx, fy, cx, cy):
    h, w = depth.shape
    u, v = np.meshgrid(np.arange(w), np.arange(h))
    z = depth
    x = (u - cx) * z / fx
    y = (v - cy) * z / fy
    points = np.stack([x, y, z], axis=-1).reshape(-1, 3)
    # Drop invalid points where depth is zero
    valid = points[:, 2] > 0
    return points[valid]

Voxels

A representation that divides 3D space into a uniform cubic grid. Think of it as the 3D version of a 2D pixel. It is regular and easy to handle, but memory grows cubically as resolution rises. To ease this, hierarchical structures such as octree-based OctoMap are used.

Occupancy grid

A map where each cell holds a probability of "occupied / free / unknown." It is especially useful in robot navigation because it directly expresses where the robot can travel.

Occupancy grid (2D slice example)
   . = free space   # = obstacle   ? = unexplored

   ? ? ? ? ? ? ?
   ? . . . # # ?
   ? . R . # . ?      R = robot's current position
   ? . . . . . ?
   ? # # . . . ?
   ? ? ? ? ? ? ?

Representation	Strengths	Limits	Main use
Point cloud	Faithful to raw, precise	Unordered, large	Registration, reconstruction
Voxel	Regular, easy to process	Cubic memory growth	3D deep learning, collision checks
Occupancy	Directly expresses traversability	Loses fine shape	Path planning, navigation

6D Pose Estimation

For a robot that grasps objects (manipulation), knowing only where an object is on screen (2D) is not enough. It must know the 3D position (x, y, z) and the three axes of orientation (roll, pitch, yaw) to reach out accurately. Together these six degrees of freedom form the 6D pose (6-DoF pose).

6D pose = position 3 + rotation 3

        z
        │   ↺ yaw
        │  /
        │ /
        └──────── y
       /│
      / │  ↺ pitch
     x  │
        ↺ roll

  To grasp: decide where (position) and in which orientation (rotation) to place the hand

There are two broad approaches: registering a known 3D model against the observation (e.g. ICP, Iterative Closest Point), and regressing the pose directly from the image with deep learning. In practice a hybrid is common: deep learning gives an initial estimate, and ICP refines it precisely.

Deep-Learning Perception — Detection and Segmentation

Once you know depth and shape, the next step is semantic perception, knowing "what it is."

Object detection: locate objects in the image and predict bounding boxes and classes.
Semantic segmentation: assign a class to every pixel (this pixel is floor, that pixel is a person).
Instance segmentation: distinguish individual objects even within the same class (person 1, person 2).

   Input image               Detection          Semantic segmentation
  ┌──────────┐          ┌──────────┐        ┌──────────┐
  │ person cup│   ──▶   │ [person] box│  ──▶ │ per-pixel │
  │   desk    │          │ [cup] box  │        │ paint     │
  └──────────┘          └──────────┘        └──────────┘

In robotics this is fused with 3D information. For example, overlaying the pixel region segmented as "cup" onto the depth frame yields the cup's 3D position, which can go straight into a grasp plan. Recently, prompt-based segmentation (e.g. the Segment Anything family) and other pre-trained general models are increasingly used as the front stage of robot perception. Still, in real-world deployment you must always verify degradation from lighting, occlusion, and domain shift.

NeRF and 3D Gaussian Splatting

Two techniques have recently shaken up the field of 3D reconstruction.

NeRF (Neural Radiance Fields): learns a scene as a single neural network function from images taken at many viewpoints. Given an arbitrary point in space and a viewing direction, it outputs the color and density there, which are integrated along a ray to render an image from a new viewpoint. It enables very photorealistic reconstruction, but training and rendering were heavy.
3D Gaussian Splatting: represents a scene as a set of many 3D Gaussians (soft blobs with position, size, orientation, color, and opacity). It renders by projecting them onto the screen (splatting), which is far faster than NeRF and enables real-time rendering, drawing much attention.

Traditional          NeRF                 3D Gaussian Splatting
 point/voxel      continuous fn (MLP)      set of 3D Gaussians
 ┌─┬─┬─┐          f(x, y, z, θ, φ)         ○ ◦ ● ◦ ○  (each blob has
 ├─┼─┼─┤    ──▶   = (color, density)  ──▶  ● ◦ ○ ● ◦   position, color, opacity)
 └─┴─┴─┘          render by ray integral   render by projection (fast)

From a robot's perspective these are appealing. They can build high-quality 3D maps, reconstruct photorealistic scenes for simulation, or predict viewpoints the robot has not yet seen. Research grafting Gaussian representations onto SLAM is indeed active. That said, such representations usually assume static scenes and still carry practical limits for moving objects or large real-time environments. When and where to adopt them should be judged carefully.

Coordinate Frames and Transforms — The Hidden Spine of Perception

The place where the most bugs hide in 3D perception is, surprisingly, coordinate frames. A robot system holds many frames at once. The world a camera sees is the camera frame, the robot's fingertip is the end-effector frame, the robot's torso is the base frame, and a world frame spans the whole room. To "grasp" an object, you must transform the object position perceived in the camera frame into the base frame in which the robot moves its arm.

Transform chain (moving an object from camera to robot base)

  world frame (room)
     │  T_world_base (where the robot is in the room)
     ▼
  base frame (robot torso) ◀──── arm planning happens here
     │  T_base_cam (where the camera is mounted on the robot)
     ▼
  camera frame ────▶ object position (perceived here)

  object_base = T_base_cam x object_camera

Such transforms are expressed as 4x4 homogeneous transformation matrices. Rotation and translation are packed into one matrix, so you can cross between frames by matrix multiplication alone. Robot middleware provides tools (e.g. ROS tf) that manage the relationships among these transforms as a tree.

Transform tree (tf tree)
   world
     └── base_link
           ├── camera_link
           │     └── camera_optical_frame
           └── arm_base
                 └── ... ── end_effector

  each edge is one transform. the transform between any two frames is
  obtained by multiplying matrices along the tree path.

Common mistakes in transforms are confusing direction (from which frame to which) and getting the axis order wrong when converting between rotation representations (Euler angles, quaternions, rotation matrices). Quaternions are widely used in robotics because they avoid gimbal lock, but they are hard for people to read intuitively.

Sensor Fusion in Depth

We said earlier that multiple sensors cover each other's weaknesses. This fusion has two broad layers.

Low-level fusion: combine raw measurements directly. For example, align a camera image and LiDAR points into the same frame, so you assign depth to each pixel or color to each 3D point.
High-level fusion: probabilistically combine the results each sensor estimated independently (e.g. object position).

The core tool in temporal fusion is a state-estimation filter. Notably the Kalman filter and its extensions are widely used for robot position and velocity estimation.

The two steps of a Kalman filter (repeated)

  predict                      update
  ┌──────────────┐            ┌──────────────┐
  │ roll state    │            │ correct the   │
  │ forward with   │──────────▶│ prediction    │
  │ motion model  │            │ with new obs   │
  │ (uncertainty↑)│            │ (uncertainty↓)│
  └──────────────┘            └──────────────┘
        ▲                            │
        └──────── next cycle ◀───────┘

  key: weighted average of prediction uncertainty and observation
       uncertainty → put weight on whichever is more trustworthy

An IMU (inertial measurement unit) gives acceleration and angular velocity very fast but drifts over time. Cameras and LiDAR are slow but give an absolute reference. Weaving the two through a filter (e.g. visual-inertial odometry, VIO) yields fast, stable pose estimation. This becomes a strong foundation for the SLAM front-end.

A Practical Pipeline Sketch

Tying the pieces so far into one flow, the typical perception-action loop of a mobile robot looks like this.

Perceive → plan → control loop

 [sensors] ──▶ [preprocess] ──▶ [SLAM/state est] ──▶ [semantic perception]
 camera        denoise           position + map        detect/segment
 LiDAR         register/correct                        affordance
 IMU                                                     │
                                                         ▼
 [control] ◀── [motion plan] ◀── [task plan] ◀── [scene understanding]
 joint cmds     collision-free    what to do        what is where
    │
    ▼
 robot acts ──▶ world changes ──▶ (loop repeats)

Each block is filled by the techniques covered in earlier sections. The key point is that this loop must run within a real-time budget. If perception takes 200 milliseconds and the robot moves at 1 meter per second, by the time it decides the world has already moved 20 centimeters. So in practice a design that splits perception into multiple cycles (a fast safety-reaction loop plus a slow semantic-understanding loop) is common.

# Conceptual perception-action loop (near-pseudocode example)
def perception_action_loop(robot):
    while robot.is_active():
        frame = robot.sensors.grab_synced()      # time-synced sensor grab
        cloud = preprocess(frame)                 # denoise, align frames
        pose, local_map = slam.update(cloud)      # localization + mapping
        objects = detector.detect(frame.rgb)      # deep-learning detection
        objects = lift_to_3d(objects, frame.depth)  # 2D detection to 3D
        goal = task_planner.decide(objects, pose) # what to do
        path = motion_planner.plan(goal, local_map)  # collision-free path
        robot.controller.follow(path)             # obey force/speed limits

This code simplifies the skeleton of a real system. In practice each stage is split into a separate process communicating by messages, with dense exception handling like a safe stop on failure.

How Do You Evaluate Perception Performance

The phrase "perception works well" must be measurable. Different tasks use different metrics.

Task	Representative metric	Meaning
Object detection	mAP (mean average precision)	precision-recall balance over thresholds
Semantic segmentation	mIoU (mean intersection/union)	overlap of predicted and ground-truth regions
Depth estimation	RMSE, absolute relative error	difference between estimated and true depth
SLAM	ATE (absolute trajectory error)	divergence of estimated and true path
6D pose	ADD family	distance between model points placed by estimated vs true pose

There is a caveat when reading metrics. A high benchmark score does not guarantee field performance. The dataset may be biased, and the lighting and object distribution the robot actually meets may differ from the benchmark. So mature teams take, alongside benchmark scores, the success rate on their own robot's real mission as the final metric.

Monocular Depth Estimation — Seeing Depth with One Eye

There is also a way to estimate depth with a single camera, without stereo or LiDAR. Monocular depth estimation predicts depth directly from an image with deep learning. Instead of physical principles, it uses prior knowledge learned from vast data. For instance, the network internalizes cues like "distant objects look small, the horizon is far, and familiar objects (doors, chairs) have roughly fixed sizes."

Principle of monocular depth estimation (intuition)

  a single image ──▶ neural net ──▶ per-pixel depth map
                     │
                     ├─ cue 1: object size (reference familiar objects)
                     ├─ cue 2: perspective, vanishing point
                     ├─ cue 3: blur, occlusion relations
                     └─ cue 4: density change of texture

  note: absolute scale can be ambiguous (often only relative depth is certain)

The limit of the monocular approach is scale ambiguity. From a photo alone it is in principle hard to tell "a small object nearby" from "a large object far away." So in robotics, monocular estimation is often combined with other sensors, or used together with the viewpoint change (motion parallax) produced by robot motion.

An Overview of Dynamic-Environment SLAM

We noted earlier that most classical SLAM assumes a static world. But reality has walking people, rolling carts, and opening doors. Such dynamic elements disturb SLAM in two ways. First, they contaminate the map by wrongly inserting feature points of moving objects. Second, they mislead self-position by using a moving background as reference.

Approaches to dynamic SLAM

  1) Detect: what is moving?
     - identify dynamic candidates like "person, vehicle" by segmentation
     - detect outliers by geometric-consistency violation
              │
              ▼
  2) Separate: static background vs dynamic object
     - use only static features for SLAM (stable map)
     - track dynamic objects separately or ignore them
              │
              ▼
  3) (optional) Track: predict dynamic objects separately
     - predict where a person will go and use it for avoidance

The core idea is a balance: filter dynamic elements as outliers to protect the static map, yet do not ignore those dynamic elements entirely for safety. Build the map from the static background, and pass moving targets like people to the avoidance plan.

Pitfalls You Meet in Practice

Calibration: even a small error in a sensor's intrinsic or extrinsic parameters throws off 3D reconstruction badly. Camera-to-LiDAR extrinsic calibration is especially finicky.
Time synchronization: if timestamps across sensors drift apart, data points refer to different instants while the robot moves fast, and registration breaks.
Dynamic environments: most classical SLAM assumes a static world. When people or vehicles move, they must be filtered as outliers.
Real-time constraints: perception must finish within the robot's control cycle. The balance between accuracy and latency is a constant concern.
Domain shift: a perception model that worked in the lab often collapses at a site with different lighting, materials, and backgrounds. The habit of validating on field data before deployment matters.

Closing

Robot perception begins with the work of "reclaiming the lost third dimension," then stitches that space into a map (SLAM), represents it suitably (point cloud, voxel, occupancy), estimates object poses (6D pose), and assigns meaning (detection, segmentation) across several layers. New representations such as NeRF and 3D Gaussians keep widening the picture.

The key point is that no single technique is a silver bullet. Each sensor has different strengths and weaknesses, and each representation has a different use. A good robot perception system combines them to fit the problem and keeps refining while steadily verifying against real-world pitfalls. In the next article, we look at how a robot that sees the world this way can act safely.