Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

A long-standing bottleneck in robot learning is data. For a robot to learn by trial and error, real hardware must run for a long time, and collecting demonstrations by teleoperating it one by one is slow and expensive. Watching language models leap forward on the internet's vast text, robotics researchers ask a natural question: "Could a robot learn from the human videos overflowing on the internet?"

YouTube holds a near-infinite pile of videos of people cooking, assembling, and organizing. Inside them lies rich knowledge of "how a hand approaches an object, in what order it manipulates, and toward what goal." The trouble is that a human body and a robot body differ, and that a video does not spell out the action commands for a robot to imitate.

This article looks at how a robot learns from human video. We cover what can be learned (affordances, trajectories, goals), what is hard (the domain gap), how it is overcome (representation learning, pre-training, imitation), and how it combines with robot data. We honestly note the real approaches and their limits.

Why This Question Now

The idea itself is not new. "Learning from observation" is an old topic in robotics research. Yet lately this question has grown especially hot, for reasons.

First, the success of language models offered a powerful analogy. Language models pre-trained on vast unlabeled text to absorb world knowledge, then adapted to specific tasks with little data. If robots can walk the same path, unlabeled human video could be that "vast pre-training data."

Analogy with language models

language model: vast text (no labels) ──▶ pre-train ──▶ adapt with small task data

│

robot: vast human video (no labels) ──▶ pre-train ──▶ adapt with small robot data

│

same idea: "learn broadly, tune narrowly"

Second, the tools matured. Hand-pose estimation, object detection, and video-understanding models got good enough that extracting useful signals from video became practical. Third, the robot-data bottleneck grew more pressing. Training a large model like a VLA needs vast data, and robot demonstrations alone struggle to fill that scale. So expectations for cheap human video grew.

As these three currents overlapped, "robots that learn from human video" moved from an old dream to active current research.

What Can Be Learned from Human Video

Human video has no action labels, yet it still carries useful signals at several layers.

Signals you can extract from one scene of human video

┌──────────────────────────────────────┐

│ Affordance: this cup's handle "can be │

│ grasped" │

│ Trajectory: the hand moved A→B like this│

│ Goal: the point was to pour water in │

│ Order: open lid → pour → close steps │

│ Contact: when hand touches/leaves object│

└──────────────────────────────────────┘

- Affordance: knowledge of what interactions an object permits. A handle is grasped, a button is pressed, a drawer is pulled. Watching human video teaches which part is handled how.

- Trajectory: the path a hand or object traces in space. It does not map directly to robot joint angles, but goal-level information of "what moved where" can be carried over.

- Goal and order: the final state and intermediate steps of a video tell the robot "what must be achieved."

The key is transferring high-level knowledge (what, in what order, to where) rather than low-level commands (move which joint by how much).

How the signals are actually extracted

Extracting these signals from video uses already-mature computer-vision tools. Hand pose comes from hand pose estimation, object position and class from detection and segmentation, and contact between hand and object is inferred by combining the two. The signals pulled out this way become the material for robot learning.

Human video → signal-extraction pipeline

raw frame ──▶ hand pose estimation ──▶ hand trajectory (3D hand position over time)

├─▶ object detect/segment ──▶ what is handled, and where

└─▶ contact inference ──▶ when it grasps and releases (grasp events)

result: a structured record of "when the hand handled which object how"

The hand trajectory that comes out does not map directly to the robot gripper's trajectory, but it provides the skeleton of "where it approaches and where the grasp happens." Filling that skeleton with the robot's own low-level control is the work of later stages.

Turning affordances into a map

Affordances are often learned as a heatmap. It is a map painting, over the image, the probability that "you can grasp here" in color. By observing in video which part of an object a person actually grasps, you can assign a high affordance score to that part and learn it.

Affordance heatmap (concept)

cup image affordance map

┌────────┐ ┌────────┐

│ ▢▢ │ │ .. │ . = low (hard to grasp)

│ ▢▢█ │ ──▶ │ ..## │ # = high (handle, good to grasp)

│ handle │ │ ### │

└────────┘ └────────┘

the robot prioritizes the # region as a grasp candidate

This affordance map serves as prior knowledge for deciding "where to grasp" when the robot meets a new object. Learned from video of a person handling countless objects, the robot can guess plausible grasp points even for objects it has never seen.

The Domain Gap — The Biggest Wall

The core difficulty of learning from human video is the domain gap. People and robots differ in many ways.

Human demo Robot execution

┌──────────────────┐ ┌──────────────────┐

│ form: five fingers│ gap 1 │ form: 2-finger │

│ │ │ gripper │

│ view: 1st/3rd │ gap 2 │ view: robot camera│

│ speed/rhythm: human│ gap 3 │ speed: control │

│ │ │ cycle │

│ action label: none│ gap 4 │ action: joint cmd │

│ │ │ required │

└──────────────────┘ └──────────────────┘

how to bridge these gaps is the heart of the research

- Embodiment gap: a human hand has five fingers with many joints, but a robot gripper often has two fingers. Delicate human finger motion cannot be transferred as is.

- Viewpoint gap: video is from a human's eyes or a third-party view, but the robot sees through its own camera. The same scene looks entirely different.

- Action gap: video has only pixels, no joint commands for the robot to execute. This "absence of action labels" is the most fundamental.

Because of these gaps, you cannot make a robot directly imitate human video. So research has developed several strategies to route around or bridge the gaps.

Ways of handling the embodiment gap

Responses to the embodiment gap fall broadly into three branches.

Embodiment-gap strategies

1) Retarget the hand to the robot hand

human hand joints ──▶ map to robot gripper shape

(not perfect, but grasp points and approach directions carry over)

2) Ignore the hand itself, be object-centric

focus on "how the object moved," not "the hand"

──▶ the robot reproduces the same object change in its own way

3) Take only the goal state

set the demo's final/intermediate state as the goal

──▶ the robot reaches that goal with its own body

The object-centric view in particular sidesteps the embodiment gap elegantly. "How the human hand moved" differs from the robot, but the result "the cup was moved from the table to the shelf" is the same whether robot or human. Focusing on the result makes the difference in body less important.

The viewpoint gap and domain adaptation

For the viewpoint gap, domain adaptation techniques are used. It aligns representations so that human-view video and robot-view video sit in "the same feature space." Then what was learned from human video works from the robot's viewpoint too.

Viewpoint alignment (concept)

human-view features ─┐

├──▶ shared feature space ◀── representation with viewpoint erased

robot-view features ─┘ │

▼

knowledge learned here works for both

When this alignment goes well, knowledge learned from a cooking video shot in third person becomes useful on the robot's first-person camera too. But if the viewpoint difference is extreme (e.g. an overhead video vs a fingertip camera), alignment gets hard and performance drops.

Representation Learning and Pre-Training

The most widely used strategy is to first learn a good visual representation from human video. Instead of imitating actions directly, you pre-train the ability to understand video, then fine-tune actual manipulation on top of it with a small amount of robot data.

Two-stage learning strategy

Stage 1: pre-train representation on large-scale human video

┌────────────────────────────┐

│ web video ──▶ train encoder │ "understand the world"

│ (features of objects/hands/ │

│ motion) │

└────────────────────────────┘

│ transfer learned representation

▼

Stage 2: fine-tune with a small amount of robot data

┌────────────────────────────┐

│ robot demo ──▶ learn policy │ "actual manipulation"

│ (connect to joint commands) │

└────────────────────────────┘

The advantage of this approach is clear. Use little of the expensive robot data and a lot of the cheap human video. Since the pre-trained representation already knows "what an object is and how a hand moves," learning actual manipulation is far faster. It is the language-model pre-train-then-fine-tune paradigm carried over to robots.

Relatedly, there is research that learns visual-language features from human video and uses them as the backbone of a robot policy. That said, which representation is truly useful for manipulation is still an actively explored question, and there is no cure-all.

One-Shot / Few-Shot Imitation

Another fascinating direction is one-shot or few-shot imitation. When a person shows a new task once (or a few times), the goal is for the robot to imitate it right away.

The ideal of one-shot imitation

person demos once ──▶ robot reproduces immediately

┌──────────┐ ┌──────────┐

│ "fold it │ │ robot does│

│ like this"│ ─────▶ │ similar │

└──────────┘ │ task │

└──────────┘

(has learned "how to learn" over many tasks in advance)

For this to work, the robot must have learned in advance, across many tasks, "how to watch a demonstration and imitate it" (the idea of meta-learning). Then a single demonstration of a new task suffices for generalization. In reality it still works well only within a limited range of tasks, and performance drops as the embodiment and viewpoint gaps grow. Even so, the direction of "show once, imitate" has the potential to greatly broaden robot use, so it is steadily researched.

Goal-Conditioned Learning — Using Human Video as a Goal

Another useful view is to see human video as a source of goals. Give the robot a goal, "make this state," and the robot finds its own way to reach it. Human video provides exactly a rich set of examples of this "desired state."

Goal-conditioned policy

goal frame from human video ──▶ "make this state"

│

▼

current state + goal state ──▶ policy ──▶ action

│

▼

closer to the goal? ── no ──▶ repeat

└─ yes ──▶ done

The advantage of this approach is that no action labels are needed. Even if you do not know "what command was issued" from human video, you can read "what state was desired (the goal)" from the frame. The robot separately learns, from robot data or trial and error, how to achieve that goal with its own body.

Human video as a reward signal

Going one step further, there is an approach that uses human video to define the reward for reinforcement learning. It measures how much the robot's current state resembles the progress of the human demonstration, giving higher reward the more it resembles. Then the robot learns in the direction of "becoming as the human did."

Human-video-based reward (concept)

human demo progress: [start]──[middle]──[done]

│ │ │

compare robot state: similarity measure

│

▼

reward = similarity to human progress ──▶ higher = "following well"

This approach is appealing because a person need not hand-design a detailed reward function. But what to measure similarity by is tricky, and a poor design risks the robot finding a "superficially similar" shortcut. This touches the specification gaming covered in the earlier safety and alignment article.

Web-Video Scale-Up

Just as language and image models leaped forward with data scale, robots chase the dream of web-video scale-up. The hope is that pulling the internet's vast human-activity video into training will yield broad world knowledge beyond narrow robot datasets.

Data-scale comparison (conceptual)

robot demos ▓▓ small but accurate (has action labels)

human demo video ▓▓▓▓▓▓ medium (direct manipulation, no labels)

all web video ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ vast (diverse, very noisy)

└── large in scale, but hard to use directly, so needs curation/connection

But scale is not the same as usefulness. Web video is diverse yet noisy, and most of it is irrelevant to robot tasks. Selecting relevant manipulation scenes, extracting useful signals (hand pose, object interaction), and transferring across the domain gap are all hard problems. So in practice, rather than training a robot on web video alone, the mainstream is a combination that gains broad representations from web video and connects accurate actions from robot data.

Combining with Robot Data

The most practical trend is co-training that uses human video and robot data together. It combines the strengths of broad but imprecise human data and narrow but accurate robot data.

Structure of combined learning

web/human video ─────┐

(broad world knowledge)│

▼

┌───────────┐

│ co-training │──▶ robot policy that generalizes well

└───────────┘

▲

robot demos ─────────┘

(accurate action commands)

This combination stands out in the recent vision-language-action (VLA) model trend. Co-fine-tuning, which trains a model pre-trained on the web's visual-language data together with robot trajectories, is an attempt to gain broad world understanding and concrete manipulation ability at once. Representative examples in this direction include RT-2, which connects web-scale visual-language knowledge to robot actions; Open X-Embodiment, which pools data across many robots; and OpenVLA, an open VLA model. These show a large trend of sharing vast prior knowledge instead of learning from scratch on every robot. (Concrete performance and features may vary by version and configuration.)

The layers of data

The data used in human-video learning splits into several layers by character. Each layer trades off scale and accuracy differently.

| --- | --- | --- | --- |

The key of this table is that "higher up is larger but less accurate, lower down is smaller but more accurate." A good system climbs these layers like a ladder. It understands the world at the broad layer and masters accurate actions at the narrow layer.

Stacking like a curriculum

Arranging these layers as a learning order forms a kind of curriculum.

Learning curriculum (broad and cheap first → narrow and expensive)

[web visual-language] ──▶ [egocentric video] ──▶ [human demos] ──▶ [robot demos]

world and language hand and object affordances accurate

understanding interaction and trajectories actions

│ │

└────── each stage becomes the foundation for the next ───────────┘

There is an intuitive reason for this order. Learning directly from robot data while knowing nothing about the world and objects is inefficient, because you must learn too much from little data. But after building the basics on broad data, robot data only needs to handle the final "action connection," so far less of it suffices.

A Feel Through Examples

To make the concept concrete, a typical pipeline by which human-video signals flow into robot learning can be summarized as follows.

Typical pipeline

1) Collect: gather a large volume of human manipulation video

2) Extract: pull hand-pose, object, contact, trajectory signals

3) Represent: pre-train a visual encoder on these signals

4) Transfer: adapt the representation to the robot's camera view

5) Combine: learn actions with a small amount of robot demos

6) Deploy: verify and correct in the real environment

At each stage in this flow, a bit of the domain gap is bridged. No stage is perfect, so real systems stack several stages to cover each other's weaknesses.

How Do You Evaluate the Effect

To prove "did human video really help," you need a fair comparison. The most common method is a controlled experiment. You pit a robot that used human video against one that did not, on the same task, and compare success rate, learning efficiency, and generalization.

Fair-comparison design (ablation)

condition A: learn from robot data only

condition B: human-video pre-training + robot data

│

▼

evaluate with the same task and same amount of robot data

│

▼

measure: success rate / data required / generalization to new objects

if B beats A → evidence that human video contributed

Several metrics are used together. Task success rate is the most direct; data efficiency (the number of robot demos needed for the same performance) shows the practical value of human video; and generalization looks at success on objects and arrangements not in training. Generalization matters especially, because the real promise of human video is "broad world knowledge."

There is a pitfall to watch. If you show success with just a few well-chosen demo videos, it is hard to tell whether it is broad capability or narrow overfitting. So evaluation should be reported honestly: on conditions the robot has not seen, repeated many times, including failures.

Considerations in a Practical Workflow

There are real-world considerations when actually running human-video learning.

- Data-curation cost: web video cannot be used as is. Selecting relevant scenes, reviewing copyright and privacy, and extracting signals take considerable effort.

- Compute resources: large-scale video pre-training is heavy. For many teams, taking a published pre-trained model is more realistic.

- Balance with robot data: mixing in too much human video blurs the robot's accurate actions, and too little fails to gain broad knowledge. Tuning the ratio is subtle.

- Safety verification: actions learned from human video must still pass through the safety layer covered in the earlier safety article. "As a person did it" is not always safe.

The reality of a practical pipeline

ideal: web video ──▶ magic ──▶ capable robot

reality: web video ──▶ [curate] ──▶ [pre-train or existing model]

──▶ [combine robot data] ──▶ [safety layer] ──▶ [field verify]

│

each stage costs human judgment and money

Limits and Open Questions

- The embodiment gap is fundamental. There are physical limits to transferring the delicacy of a human hand onto a two-finger gripper.

- The absence of action labels is still a large wall. Inferring accurate robot commands from video is inherently ambiguous.

- The reliability of transfer is a problem. How well a representation carries over depends heavily on the task and environment.

- Evaluation is hard. Fairly measuring "did human video really help" is itself a research challenge.

- Beware of overinterpretation. An impressive demo does not mean general capability. Do not mistake success under narrow conditions for generalization.

When It Shines and When It Dims

Human-video learning is not a cure-all; there are problems it fits. Knowing when the effect is large and when it is hard is the crux of practice.

Fit of human-video learning

Well-suited problems Hard problems

┌──────────────────┐ ┌──────────────────┐

│ things people │ │ robot-specific │

│ commonly do │ │ precision tasks │

│ (cook, tidy, │ │ (things people │

│ manipulate) │ │ do not do) │

│ object-centric │ │ fine force control │

│ goals │ │ human-hand dexterity│

│ broad object │ │ │

│ generalization │ │ │

└──────────────────┘ └──────────────────┘

For things people commonly do in daily life, and things whose goal can be expressed by an object's state change, human video shows great power, because examples overflow on the internet and the embodiment gap can be sidestepped object-centrically. Conversely, for robot-specific tasks people rarely do (certain motions of precision assembly, etc.) or dexterity that requires a human hand's five fingers, the help from human video is limited.

This distinction matters in practice. Rather than forcing human video into every task, applying it to tasks where it is truly advantageous saves resources. Good engineering distinguishes "can use" from "should use."

The Three Branches at a Glance

Summarizing the kinds of learning signal covered so far, ways of using human video split broadly into three branches.

| --- | --- | --- | --- |

The three approaches are not exclusive and are often mixed. You can combine them: build a foundation with representation, give direction with goals, and refine details with reward. Which combination is best depends on the task and available data, so rather than picking a "right answer," comparing several ways by experiment is the practical attitude.

Closing

A robot that learns from human video is an appealing dream. If the knowledge of human activity piled up on the internet can be conveyed to robots past the bottleneck of expensive robot data, robot learning may achieve a leap like the one language models saw.

At the same time, this path is laid with fundamental gaps in embodiment, viewpoint, and action. The current practical approach lies in a combination that gains broad representations from human video and connects accurate actions from robot data. The co-training trend of VLA models is its representative evidence. There is no finished solution yet, but the direction is clear. How well we weave together cheap and vast human knowledge with expensive but accurate robot experience — on this hinges the future of web-scale robot learning.

One thread runs through these three robot articles. Making a robot capable and making that capability something we can understand and trust are not separate matters. When the three axes of seeing the world well (perception), acting safely (safety and alignment), and learning from humanity's vast experience (human-video learning) grow together, a robot finally becomes a useful and trustworthy presence at our side. Technology moves fast, but keeping the balance of these three axes is the condition of lasting progress.

References

- [RT-2: Vision-Language-Action Models (arXiv: 2307.15818)](https://arxiv.org/abs/2307.15818)

- [Open X-Embodiment (arXiv: 2310.08864)](https://arxiv.org/abs/2310.08864)

- [OpenVLA (arXiv: 2406.09246)](https://arxiv.org/abs/2406.09246)

- [R3M: visual representations for robots (arXiv: 2203.12601)](https://arxiv.org/abs/2203.12601)

- [Ego4D: large-scale egocentric video dataset (arXiv: 2110.07058)](https://arxiv.org/abs/2110.07058)

- [DROID robot dataset (arXiv: 2403.12945)](https://arxiv.org/abs/2403.12945)

- [Physical Intelligence official site](https://www.physicalintelligence.company/)

- [ROS (Robot Operating System) official docs](https://docs.ros.org/)