Skip to content

필사 모드: Analyzing SOTA 3D Vision — Monocular Depth and 3D Gaussian Splatting

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

A 2D image is a 3D world pressed flat. The camera discards depth once, and we struggle to recover that lost dimension. 3D vision can be split into two questions. One is depth estimation: "how far is each pixel?" The other is 3D representation and rendering: "can we redraw this scene from an arbitrary viewpoint?"

These two questions call for different tools. Depth estimation infers distance from a single camera's image, resembling the human ability to feel perspective with one eye. 3D representation and rendering gathers information from multiple viewpoints to store the whole scene and redraw it from a desired angle. This article follows the two questions in turn, examining how each field's representative ideas have connected over time.

This article organizes both axes, centered on architectural principles. It covers monocular depth estimation and its representative family, an overview of stereo and multi-view, and the arc from NeRF to 3D Gaussian Splatting for real-time rendering. Since 3D vision SOTA also shifts fast, I avoid firm claims about any model's ranking or numbers and explain things conceptually.

Let me settle one term up front. A depth map is often called "2.5D." It holds each pixel's distance, but it cannot know what is behind the camera or occluded, so it is not full 3D. In contrast, what NeRF or Gaussian Splatting builds is a 3D representation that can be redrawn from arbitrary viewpoints. This is why this article distinguishes "depth" from "3D representation." The two are connected, but they differ in the completeness of the information they hold.

Defining the Depth Estimation Problem

Depth estimation predicts the distance to each pixel from an image. It divides mainly by input configuration.

- **Monocular depth**: Estimate depth from a single image. This is fundamentally an **ill-posed** problem, because the same 2D image can arise from many 3D scenes. The model fills this ambiguity with learned priors (perspective, object size, texture, occlusion relations, and so on).

- **Stereo**: Compute depth by triangulating the disparity between two cameras. With clear geometry, it is a relatively well-posed problem.

- **Multi-view stereo (MVS)**: Reconstruct a scene's 3D structure from images at multiple viewpoints.

depth estimation input configurations (concept)

monocular: [one image] --> infer depth from priors (ambiguity present)

stereo: [left][right] --> triangulate via disparity (clear geometry)

MVS: [many views] --> reconstruct 3D via multi-view matching

Relative Depth and Metric Depth

A distinction that comes up often in monocular depth is this.

- **Relative depth**: Gives ordering and ratio information like "A is closer than B." There is no absolute metric value. It is often sufficient for scene understanding and editing.

- **Metric depth**: Gives depth in actual physical units like "this pixel is 3.2 meters." It matters for applications like robotics and surveying that need absolute distance, but is reliable only when combined with information like camera focal length.

Obtaining stable metric depth monocularly is harder than relative depth, due to scale ambiguity. In practice one first obtains relative depth and corrects the scale with separate information, or uses learning and calibration aimed directly at metric depth.

The Cues Depth Uses

There are visual cues a monocular model relies on when "guessing" depth. These cues are not much different from the principles by which a person feels perspective even with one eye.

- **Perspective**: The property that parallel lines converge to a point in the distance. A powerful cue in scenes like roads and corridors.

- **Relative size**: For objects of the same kind, the one that looks smaller is inferred to be farther.

- **Occlusion**: If A hides B, A is in front. It gives ordering information.

- **Texture gradient**: The property that surface patterns look denser as they recede.

- **Shading and shadows**: The direction of light and shadow hints at surface shape and distance.

The model learns and combines such cues from data on its own. So in unfamiliar scenes absent from training data (e.g., unusual scale, reflective surfaces), the cues can go awry and errors can grow.

The Depth Anything Family Concept

One notable trend in monocular depth recently is **general-purpose depth models trained on large data**. The Depth Anything family can be understood as the representative concept of this direction. Generalizing the core ideas:

- **Large, diverse data**: Since labeled depth data is expensive, learning strategies that use large amounts of unlabeled images (e.g., a student learning from a teacher model's predictions) generalize across varied scenes.

- **Strong backbone representation**: Placing a depth prediction head atop a backbone that represents images well yields plausible depth even for unseen scenes.

- **Zero-shot generalization**: Rather than overfitting to a specific dataset, it targets relative depth usable directly across diverse domains.

training a general-purpose monocular depth model (concept)

large amounts of unlabeled images

|

[use teacher model's depth predictions like labels]

|

train strong backbone + depth head

|

result: plausible relative depth even for unseen scenes

(metric depth needs extra info / calibration)

Note that such models do not guarantee "perfect metric depth for any scene." Relative depth quality can be impressive, but errors can grow in hard situations like absolute scale, rare scenes, and reflective or transparent surfaces. Since detailed performance varies with data and settings, it is safer to understand this in a generalized way.

The reason this family is attractive in practice is that it obtains plausible depth from the minimal input of "one camera, one image." It needs no two precisely aligned cameras like stereo, nor multi-view capture like MVS. So it is useful for retroactively giving depth to existing footage and photos, or gaining a sense of 3D on a device with only a single camera. But the price of that convenience is carrying the uncertainty of absolute scale, and where precise distance is needed, calibration or additional sensors are required.

Stereo and MVS Overview

If monocular relies on priors, stereo and MVS directly exploit **geometry**.

- **Stereo matching**: Find the same point (correspondence) in left and right images, obtain disparity, and triangulate depth from disparity, camera baseline, and focal length. Deep stereo improves correspondence search and matching cost computation through learning.

- **MVS (Multi-View Stereo)**: Reconstruct a scene's dense 3D structure using multiple viewpoint images and camera poses. Reconstructing a building in 3D from hundreds of tourist photos is a representative task.

- **SfM (Structure from Motion)**: Jointly estimate camera poses and sparse 3D points from multiple images. It is often the front stage of the MVS and 3D reconstruction pipeline.

- **Depth completion**: Combine sparse depth measurements (e.g., LiDAR points) with an image to make a dense depth map. A practical approach using sensors and learning together.

Geometry-based methods are accurate but struggle to find correspondences on textureless surfaces, reflections, and occlusions. That is why attempts to combine monocular (priors) and geometry (stereo/MVS) continue.

Summarizing the character difference of the two approaches in one line: monocular "guesses from what it has seen," while geometry "measures from multiple angles and computes." Monocular wavers in unfamiliar scenes absent from data but has minimal input; geometry is accurate but requires conditions for finding correspondences (enough texture, overlapping views). So recently, filling geometry's gaps (textureless regions, etc.) with monocular's strong priors, and pinning down monocular's ambiguity with geometry's accurate scale, is seen as a natural direction of mutual complement.

monocular vs. geometry (character comparison)

monocular: guess from what it has seen --> minimal input, weak on novel scenes

geometry: measure from multiple angles and compute --> accurate, needs correspondence conditions

combine: monocular fills the gaps, geometry pins down the scale

3D Representation and Rendering: From NeRF to 3D Gaussian Splatting

If depth is "the distance of each pixel," 3D representation and rendering is the problem of "capturing the whole scene and redrawing it from arbitrary viewpoints." Two trends that greatly changed this field are NeRF and 3D Gaussian Splatting.

NeRF: Mapping Coordinates to Color and Density

NeRF (Neural Radiance Fields) represents a scene with a single neural network. It learns a function that takes a 3D coordinate and view direction and outputs the **color and density** at that point. To render an image, it shoots a ray from each pixel, obtains the color and density of several points along the ray via the network, and accumulates them (volume rendering) to form the final color.

NeRF rendering (concept)

pixel --> shoot a ray into the scene

|

sample several points along the ray

|

each point: [network](coord, direction) --> (color, density)

|

accumulate colors via volume rendering --> pixel color

- **Strength**: Very high novel view synthesis quality, and the representation is continuous.

- **Limit**: Rendering must evaluate the network many times per ray, so it is **slow**. Heavy training and inference were early NeRF's representative weakness. Various acceleration works followed.

NeRF's arrival mattered because the very idea "a single network can hold an entire scene" was new. It showed that a scene can be represented in the form of a function and arbitrary viewpoints synthesized, without explicitly storing points or meshes. This idea became the starting point of countless follow-up works, and within that flow, Gaussian Splatting can be seen as directly targeting the speed problem.

3D Gaussian Splatting: Real-Time Rendering via Explicit Representation

3D Gaussian Splatting represents a scene not as a neural network function but explicitly as a set of **many 3D Gaussians (ellipsoid-shaped translucent blobs)**. Each Gaussian has attributes like position, size and shape (covariance), color, and opacity. Rendering projects these Gaussians onto the screen (splatting) and composites them quickly.

3D Gaussian Splatting (concept)

scene = many 3D Gaussians (position, shape, color, opacity)

|

rendering: project (splat) Gaussians onto screen, sort and composite

|

training: compare rendered result with real images to

optimize each Gaussian's attributes

|

result: high quality + near real-time rendering speed

- **Key difference**: If NeRF is an **implicit** representation that maps coordinates to color and density, Gaussian Splatting is an **explicit** representation that places blobs directly.

- **Strength**: Rendering is fast. Since it projects and composites explicit blobs, with good optimization it produces high-quality views at near real-time speed. Editing and manipulation are also intuitive.

- **Limit**: A large number of Gaussians burdens memory, and handling reflections, transparency, and dynamic scenes needs further research. Detailed quality and speed vary with scene and implementation.

NeRF vs. 3D Gaussian Splatting

| Item | NeRF | 3D Gaussian Splatting |

|---|---|---|

| Representation | Implicit (network function) | Explicit (set of Gaussians) |

| Rendering speed | Relatively slow | Fast (can approach real-time) |

| View synthesis quality | Very high | Very high |

| Ease of editing | Relatively hard | Intuitive |

| Memory | Network weights | Proportional to number of Gaussians |

Treat the table's speed and quality as general tendencies; actual values vary greatly with scene complexity, resolution, and implementation.

A Bit More on Volume Rendering

To understand NeRF and Gaussian Splatting, you need to look a bit more at the rendering process of "how the colors of several points are combined into one pixel color."

The core intuition is "an opaque thing in front hides what is behind." As a ray passes through several points, each point has a color and "how opaque it is." If a front point is opaque, the color of a rear point is barely visible. Conversely, if the front is transparent, the rear shows through. The final pixel color is the result of weighting and summing each point's color reflecting this front-to-back relationship.

the intuition of volume rendering (concept)

ray: camera --> point1 --> point2 --> point3 --> ...

each point: (color, opacity)

if front is opaque --> rear point barely visible

if front is transparent --> rear point shows through

pixel color = weighted sum of colors reflecting front-back occlusion

NeRF computes this "each point's color and opacity" on the fly with a network. So the representation is continuous and smooth but slow, calling the network many times per ray. Gaussian Splatting has each Gaussian already holding color and opacity as attributes, so it only needs to project, sort, and composite without network calls. This difference is the root cause of the speed gap.

Trade-offs of Representation

Implicit representation (NeRF family) and explicit representation (Gaussian Splatting) each have clear strengths and weaknesses.

- **Implicit representation**: Holds a scene as a single function, so memory is compact, and being continuous is favorable for smooth surface representation. In return, rendering calls the function a lot so it is slow, and editing only a specific part is hard.

- **Explicit representation**: Places blobs (or points, voxels, meshes) directly, so rendering is fast, and editing like moving or deleting specific blobs is intuitive. In return, as the number of blobs grows memory grows, and densely filling empty space can be hard.

representation spectrum (concept)

implicit (function) <-----------------------> explicit (blobs/points)

| |

compact, continuous fast, easy to edit

slow rendering memory burden

(NeRF family) (Gaussian Splatting)

The interesting point is that the two are not exclusive. Hybrid attempts to mix the strengths of both representations continue, and which representation is "correct" depends on the application's requirements (speed, editability, memory, quality).

The Problem of Dynamic Scenes and Time

The NeRF and Gaussian Splatting covered so far basically assume a "static scene." That means the scene must not move while photos are taken from multiple viewpoints. Yet many real scenes move. People walk, leaves sway, water flows.

Approaches to handle dynamic scenes generally add the axis of "time" to the representation.

- **Deformation field**: Keep a reference static scene and separately learn how it deforms over time.

- **Time conditioning**: The representation takes a time input so the same location can yield different color and shape depending on the moment.

Dynamic scenes are also hard in data. Capturing a moving target from multiple angles simultaneously needs multiple cameras, and with a single camera viewpoint and time entangle, making the problem hard. This field is under active research, and detailed methods and performance change quickly, so it is safer to understand it at a conceptual level.

The Learning and Data Angle

- **Depth training data**: Metric depth labels come from LiDAR and depth sensors but are expensive. So strategies using relative depth, synthetic data, and self-supervision (stereo or video consistency) signals are widely used. Synthetic data in particular gives perfect depth labels for free, but handling the domain gap with reality is the challenge.

- **3D reconstruction data**: NeRF and Gaussian Splatting need multiple viewpoint images and accurate camera poses. The quality of pose estimation (SfM) greatly affects the final result.

- **Generalization vs. specialization**: Precisely reconstructing one scene and generalizing directly to unseen scenes are different goals. The choice varies by application.

The 3D Reconstruction Pipeline

Following a typical pipeline for reconstructing 3D from multiple photos at a conceptual level lets you understand why each stage is needed.

from photos to 3D (concept pipeline)

photos from multiple viewpoints

|

[feature extraction/matching] find feature points in each photo, link across photos

|

[SfM] jointly estimate camera poses + sparse 3D points

|

[dense reconstruction] make a dense representation with MVS or NeRF/Gaussian Splatting

|

result: 3D point cloud/mesh or a renderable representation

- **Feature extraction/matching**: Find prominent points in each photo and link the same point across different photos. This correspondence is the basis of all later computation.

- **SfM (Structure from Motion)**: From the correspondences, solve where each photo was taken (camera pose) and sparse 3D points together.

- **Dense reconstruction**: Once poses are known, make a dense point cloud with MVS or learn a renderable representation with NeRF/Gaussian Splatting.

A point often missed here is that the quality of pose estimation governs the final result. If camera poses are inaccurate, no matter how good the rendering method, the result comes out blurry or misaligned. So in practice the capture stage (enough overlap, varied angles, minimal shake) matters greatly.

Application Areas

- **Robotics**: Depth estimation supports obstacle avoidance, grasping, and navigation. Metric depth and real-time behavior matter. Robots often use depth sensors (LiDAR, etc.) and monocular estimation together to complement each other.

- **AR/VR**: Aligning virtual objects into real space needs depth and 3D structure. Gaussian Splatting's real-time rendering is advantageous for immersive applications. Depth is also needed for occlusion handling (rendering virtual objects hidden behind real ones).

- **3D reconstruction and digital twins**: Reconstruct spaces and objects in 3D from photos for simulation, surveying, and preservation. Used in heritage preservation, construction-site records, and more.

- **Content creation**: Capture live-action scenes in 3D for free-viewpoint video and visual effects. It becomes possible to freely change camera paths after shooting.

- **Medical and industrial inspection**: Reconstruct 3D structure from endoscopic and captured footage to aid diagnosis and measurement. Accuracy and reliability especially matter.

- **Autonomous driving**: Understanding the surrounding environment's depth and 3D structure directly affects safe driving. Fusing camera, LiDAR, and radar is a common approach.

Requirements differ by application. Robots prioritize absolute distance and latency, content creation prioritizes visual quality, and this priority drives method choice. Even the same "3D vision," the real-time safety demand of autonomous driving and the precise quality demand of heritage scanning lead to entirely different designs.

Connecting Depth and 3D

Depth estimation and 3D representation are not separate but connected. A depth map is each pixel's distance, so knowing camera information lets you turn it back into 3D points (back-projection). Aligning depth maps from several viewpoints yields a dense point cloud, which becomes the starting point of 3D reconstruction.

from depth to 3D (concept)

depth map (each pixel's distance) + camera info

|

back-projection: turn pixels into 3D points

|

align points from several viewpoints

|

dense 3D point cloud --> convert to mesh/representation

The reverse direction also exists. From a well-built 3D representation, you can render a depth map for an arbitrary viewpoint. In this way depth (2.5D) and 3D representation go back and forth, and recent systems often handle both together. Attempts to combine monocular depth's strong generalization with multi-view 3D's geometric accuracy are a representative trend.

Situations That Are Commonly Hard

Whether depth estimation or 3D reconstruction, the situations below are commonly tricky. Understanding why they are hard helps interpret results and anticipate failures.

- **Reflective and transparent surfaces**: Mirrors, glass, and water show not the surface itself but the color reflected or transmitted from elsewhere. Both correspondence matching and color/density learning get confused.

- **Textureless regions**: A featureless surface like a white wall has no cue to find "the same point," so stereo and MVS struggle.

- **Thin structures and boundaries**: Thin, complex structures like wire mesh, hair, and tree branches are tricky to both represent and reconstruct.

- **Extreme scale**: Very near or very far objects have weak cues so errors grow. This also ties to monocular's metric scale problem.

- **Lighting changes**: If lighting changes during capture, the same point's color differs, shaking multi-view alignment.

hard situations and their reasons (concept)

reflective/transparent --> visible color is not the surface --> matching/learning confusion

textureless --> no correspondence cue --> stereo/MVS difficulty

thin structures --> representation resolution limit --> breaks or smears

extreme scale --> weak cues --> depth error grows

lighting change --> color mismatch --> multi-view alignment shakes

When you meet such situations, it matters to recognize the problem itself is hard rather than "the model is bad." In practice, ease them with additional sensors (depth sensors), improved capture conditions, and fusion of several methods.

How to Read Benchmarks

There are points to note when reading the numbers in depth and 3D papers.

- **Relative or metric**: When looking at depth accuracy, first check whether it is on a relative-depth or metric-depth basis. The two differ in difficulty.

- **Dataset domain**: Check whether the dataset characteristics (indoor, outdoor, driving, etc.) resemble your application. A model good indoors may not be so outdoors.

- **Context of view-synthesis metrics**: Rendering quality metrics vary greatly with how far the evaluated viewpoints are from the training ones. Views near training viewpoints are generally easier.

Before comparing only numbers, keep the habit of also asking "under what conditions and against what basis was this value measured."

Strengths and Limits Summary

| Approach | Strength | Limit |

|---|---|---|

| Monocular depth (general) | Instant from one image, strong generalization | Metric scale ambiguity, errors on hard surfaces |

| Stereo/MVS | Geometry-based high accuracy | Vulnerable to textureless, reflective, occluded areas |

| NeRF | High-quality view synthesis, continuous representation | Slow rendering (acceleration work continues) |

| 3D Gaussian Splatting | Near real-time rendering, easy editing | Memory burden, hard on dynamic/transparent scenes |

Conclusion

3D vision can be seen as two branches of effort to "recover the lost dimension." Depth estimation targets the distance of each pixel; 3D representation and rendering targets the ability to revive the whole scene from arbitrary viewpoints. Monocular depth fills ambiguity with priors, stereo and MVS gain accuracy through geometry, and NeRF and Gaussian Splatting balance quality and speed through the choice of representation (implicit vs. explicit).

The shift from NeRF to 3D Gaussian Splatting in particular is a representative case of pushing "high-quality but slow" rendering into "high-quality and near real-time" rendering. Understanding the mindset of this lineage lets you place future new representations and models far more quickly.

The advice I want to leave for practitioners is this. Rather than hunting for an absolute answer of "the best 3D method," first decide what your application requires (is relative depth enough or is metric depth needed, is real-time needed, is editing needed, is it a static or dynamic scene). Once that requirement is set, choosing and combining the right one among the parts of monocular depth, stereo, MVS, NeRF, and Gaussian Splatting becomes far clearer. Most of the difficulty of 3D vision is a matter of "what you can give up," and clarifying that choice is the starting point of a good system.

References

- NeRF paper "Representing Scenes as Neural Radiance Fields": [arxiv.org/abs/2003.08934](https://arxiv.org/abs/2003.08934)

- 3D Gaussian Splatting paper: [arxiv.org/abs/2308.04079](https://arxiv.org/abs/2308.04079)

- 3D Gaussian Splatting project page: [repo-sam.inria.fr/fungraph/3d-gaussian-splatting](https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/)

- Depth Anything paper: [arxiv.org/abs/2401.10891](https://arxiv.org/abs/2401.10891)

- MiDaS (general monocular depth) paper: [arxiv.org/abs/1907.01341](https://arxiv.org/abs/1907.01341)

- COLMAP (SfM/MVS) project: [colmap.github.io](https://colmap.github.io)

- Vision Transformer (ViT) paper: [arxiv.org/abs/2010.11929](https://arxiv.org/abs/2010.11929)

- Depth Anything code: [github.com/LiheYoung/Depth-Anything](https://github.com/LiheYoung/Depth-Anything)

Quiz

Because the same 2D image can arise from many different 3D scenes. Depth information vanishes during projection, so the model must fill the ambiguity with learned priors.

Relative depth gives only ordering and ratio (A is closer than B) with no absolute unit. Metric depth gives distance in actual physical units, but is harder to obtain monocularly due to scale ambiguity.

Using large amounts of unlabeled images (e.g., a student learning from teacher predictions) and placing a depth head atop a strong backbone to obtain relative depth that generalizes well to unseen scenes.

Because it triangulates from the disparity between two cameras. Once correspondences are found, depth can be computed directly from disparity, camera baseline, and focal length, so ambiguity is small.

An implicit (network) representation mapping coordinates and view direction to color and density. View synthesis quality is high, but rendering is slow because the network must be evaluated many times per ray.

It represents a scene as a set of many 3D Gaussians (explicit blobs) rather than a neural function (implicit). Projecting and compositing these onto the screen makes rendering fast and editing intuitive.

A large number of Gaussians burdens memory. Handling reflections, transparency, and dynamic scenes also needs further research, and detailed quality and speed vary with scene and implementation.

Robots prioritize absolute distance (metric depth) and low latency, while content creation prioritizes visual quality. Such priority differences drive whether to use monocular, stereo, NeRF, or Gaussian Splatting.

If the front is opaque, the color of the rear point is mostly hidden; if the front is transparent, the rear shows through. Because the final pixel color is a weighted sum of each point's color reflecting this front-back occlusion.

Implicit (NeRF) is compact and continuous but slow to render and hard to edit. Explicit (Gaussian Splatting) is fast and intuitive to edit but burdens memory when the number of blobs grows.

Because if poses are inaccurate, no matter how good the rendering method, the result comes out blurry or misaligned. So managing overlap, angle, and shake at the capture stage governs final quality.

Because mirrors, glass, and water show not the surface itself but the color reflected or transmitted from elsewhere. This confuses both correspondence matching and color/density learning.

A depth map is each pixel's distance, so knowing camera information lets you back-project each pixel into a point in 3D space. Aligning points from several viewpoints yields a dense 3D point cloud, the starting point of reconstruction.

현재 단락 (1/190)

A 2D image is a 3D world pressed flat. The camera discards depth once, and we struggle to recover th...

작성 글자: 0원문 글자: 23,500작성 단락: 0/190