Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

A language model has one large model that does translation, summarization, coding, and conversation all at once. Image models, too, have one model that produces varied images. The success of foundation models — "one large model that broadly does many jobs" — naturally raised the same question for robots.

> Are foundation models possible for robots too? Can one policy do many tasks across many robots?

Traditional robot learning has mostly been specialized to "one robot, one task." A policy for picking up a cup, a policy for opening a door, a policy for closing a drawer — each was made separately, and when the robot changed, learning had to start over from scratch. **Robot foundation models** are an attempt to cross this wall and handle many situations with a single **generalist policy**.

This article lays out what a generalist policy is, the large-scale robot data that makes it possible (especially Open X-Embodiment), cross-embodiment for handling different robots as one, the relationship to VLA that handles vision, language, and action together, and scaling and the remaining challenges. This field changes very fast, so we note up front that specific performance and structure may differ by announcement and generation.

What Is a Generalist Policy

A **policy** is a function by which a robot takes observations as input and outputs actions. Seeing a camera image and an instruction, it decides where to move the arm next.

A conventional **specialist policy** is optimized for one specific task. It works well but scales poorly. If there are 100 tasks, you need 100 policies, and if the number of robot types grows, it multiplies further.

A **generalist policy** is different. One policy handles many tasks and (in some cases) many robots. What to do is mostly conveyed by **language instruction**.

┌──────── specialist (separate per task) ────────┐

│ policy A: pick up cup │

│ policy B: open door │

│ policy C: close drawer ... grows with tasks │

└─────────────────────────────────────────────────┘

┌──────── generalist (all in one) ────────────────┐

│ │

│ instruction: "pick up the red cup, put in │

│ the drawer" │

│ observation: camera image + robot state │

│ │ │

│ ▼ │

│ [ one large policy ] ──▶ action (arm/gripper) │

│ │

│ the same policy attempts dozens~hundreds tasks │

└──────────────────────────────────────────────────┘

The core idea is the same as language models. If you grow the scale (data and model) and secure diversity, the hope is that one policy generalizes broadly without programming each individual task.

Large-Scale Robot Data: Open X-Embodiment

The fuel of foundation models is data. Language models grew fed on the vast text of the internet. But robot data is not lying around on the internet. Demonstration data of real robots picking up and moving objects must be collected by people one by one, and the format differs from robot to robot.

An important attempt at this problem is **Open X-Embodiment** (arXiv 2310.08864). Multiple research institutions gathered the robot datasets they each held into one common format, creating a large-scale data collection spanning numerous robots and tasks.

datasets from many institutions and many robots

┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐

│robot A │ │robot B │ │robot C │ │ ... │

│ data │ │ data │ │ data │ │ │

└───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘

└──────────┴──────────┴──────────┘

│ unified into a common format

▼

┌──────────────────────────────┐

│ Open X-Embodiment data │

│ (trajectory collection over │

│ varied robots and tasks) │

└──────────────┬───────────────┘

│ train one policy on top

▼

a generalist policy handling many robots

Policies trained on this data (e.g., the RT-X family) tended to generalize better when trained jointly on multiple robots' data than when trained on a single robot alone. This is a signal that different robots' experiences can help each other.

Three Kinds of Data

Data for training robot policies differs greatly in character depending on where it comes from.

| Source | Advantage | Disadvantage |

| --- | --- | --- |

| Real teleoperation | Real physics, real contact | Expensive and slow to collect |

| Simulation | Mass, cheap, safe | sim2real gap |

| Human video | Vast and diverse | Mismatch with the robot body |

In practice, these three are combined. You learn basics in bulk from simulation, fill real-world gaps with real data, and gain broad common sense from human video. Because none alone is sufficient, how to mix data is a key design problem in practice.

The Importance of Data Diversity

The generalization of a foundation model depends on **diversity** as much as on the **amount** of data. Data of picking varied objects in varied environments makes a far broader policy than data of picking the same object in the same environment a million times.

narrow data broad data

┌───────────────┐ ┌───────────────┐

│ cup · bright │ │ cup·bottle· │

│ room │ │ tool·cloth │

│ one robot │ │ many robots· │

└───────────────┘ │ lighting· │

│ │ varied backdrop│

▼ └───────────────┘

good only at seen │

(weak at unfamiliar) ▼

potential to generalize to unfamiliar ↑

This is why efforts like Open X-Embodiment matter. One lab's data alone lacks diversity, but gathering many institutions' data greatly increases the diversity of objects, environments, and robots. Diversity is the fuel of generalization.

Why Joint Training Helps

Intuitively, you might think "robots have different bodies, so wouldn't mixing data confuse it instead?" But in practice joint training often helps. The reason is the **common structure** that different robots share.

robot A's experience robot B's experience robot C's experience

"pick up an object" "pick up an object" "pick up an object"

│ │ │

└────────┬────────┴────────┬────────┘

▼ ▼

shared essence of "picking" (approach, grasp, lift)

│

▼

even if one robot's data is scarce,

the shared concept is reinforced by others' experience

The essence of the task of picking up, moving, and placing an object is largely shared even with different bodies. So a situation seen a lot on one robot can be filled in by a shared concept even if another robot saw it little. This is the fundamental reason the foundation approach of jointly training on large-scale, diverse data beats individual training.

Cross-Embodiment: Different Bodies as One

**Cross-embodiment** means handling robots with different "bodies (embodiments)" with a single policy. A one-armed robot, a two-armed robot, a robot with a different gripper, a robot with a different joint count — these are physically different, but they share the essence of the task "pick up and move an object."

The difficulty of cross-embodiment is obvious.

- **Action spaces differ**: Each robot has a different joint count and control scheme, so even the same "action" is expressed differently.

- **Observations differ**: Camera position, count, and field of view all vary.

- **Physics differ**: Arm length, force, and speed differ.

One way to handle this is to align actions and observations into a **common abstract representation** as much as possible. For example, if you define the policy in a representation common to many robots — like the target position and orientation of the gripper tip — each robot's concrete joint commands can be converted behind the scenes.

common policy (abstract action: gripper target pos/pose)

│

┌────┴──────────────┬──────────────────┐

▼ ▼ ▼

convert for robot A convert for robot B convert for robot C

(joint commands) (joint commands) (joint commands)

│ │ │

▼ ▼ ▼

real robot A real robot B real robot C

This way, the "conceptual skills" one policy learned can be shared across many robots, and data obtained on one robot can also boost the performance of another.

The Relationship to VLA

Indispensable in any talk of robot foundation models is the **VLA (Vision-Language-Action)** model. A VLA is a policy that takes vision (camera) and language (instruction) as input and outputs action, connecting the achievements of language and vision models to robot action.

- **RT-2** (Google DeepMind, arXiv 2307.15818): fine-tunes a vision-language model (VLM) already trained on web data with robot data. It treats actions as discretized action tokens, integrating them with a language model's output style. Impressively, the web's visual and linguistic knowledge can transfer to robot action.

- **OpenVLA** (arXiv 2406.09246): a 7B-scale open VLA model trained on roughly 970k real robot demonstrations. It combines DINOv2 and SigLIP vision encoders with a Llama 2 language model, and being an open model it greatly helps research and reproduction.

- **π0** (Physical Intelligence): generates continuous high-frequency actions in a flow-matching / diffusion manner, aiming for precise manipulation from a direction different from discretized tokens.

- **GR00T N1** (NVIDIA): claims a dual structure combining System 1 (diffusion family) for fast reaction with System 2 for planning.

- **Helix** (Figure AI): cited as an example of the generalist VLA trend aimed at humanoids.

A technique often used here is **co-fine-tuning**. By training jointly on web vision-language data and robot trajectory data, the aim is to contain in one model both the broad common sense obtained from the web and the concrete manipulation skill of the robot. Also, efficient fine-tuning techniques such as **LoRA** are used to adapt a large model to a specific robot or task at low cost.

How Action Is Represented

One core design decision of a VLA model is "in what form to output action." There are two broad branches.

- **Discretized action**: Split the action space into bins and pick action tokens one by one, as a language model picks words. RT-2 is representative. It has the big advantage of reusing a language model's structure as-is.

- **Continuous action**: Generate action directly as real values. With flow matching / diffusion like π0, you can make smooth, high-frequency, precise motion.

discretized (RT-2 style)

action = [token1][token2][token3]... ← pick one by one like a language model

│ simple, easy to reuse, but resolution limited by bin count

continuous (π0 style)

action = generate a real-valued vector by diffusion/flow

│ smooth and precise, but training/inference is more complex

Which is better depends on the task. Discretization suffices for coarse pick-and-place, but continuous action tends to be favorable for precise assembly or flexible motion. This is still an actively explored design space.

The Internals of a VLA

A typical VLA is made roughly of the following parts.

camera image ──▶ [vision encoder] ──┐

├──▶ [language/fusion backbone] ──▶ [action head] ──▶ action

language instruction ──▶ [text enc] ─┘

(robot state) ──────────────────────┘

· vision encoder: DINOv2, SigLIP, etc. (image to features)

· backbone: Llama family, etc., large language model (vision-language fusion, reasoning)

· action head: generate discrete tokens or continuous action

This is exactly the structure we described earlier when we said OpenVLA combined DINOv2 and SigLIP vision encoders with Llama 2. It takes strong vision and language parts pretrained on the web, attaches a head that emits robot action, and trains it with robot data. Thanks to the pretrained parts, even with relatively little robot data, it can leverage the web's knowledge.

The Scaling Perspective

One lesson of language models was that "growing the scale improves capability more broadly than expected." Robot foundation models carry the same expectation.

data scale/diversity ↑ model scale ↑

│ │

└───────────┬───────────┘

▼

generalization to broader tasks/robots (expected)

│

┌───────────┴───────────┐

▼ ▼

cope with unseen objects do new tasks by changing

(expected, not guaranteed) only the instruction

(expected, not guaranteed)

However, there is a decisive difference between language and robots. Text exists on the internet in effectively unlimited amounts, but robot data must be collected by actually moving a body in the physical world. Data collection is far more expensive and slow. So use of simulation data, learning from human videos, and data-efficient methods are actively researched as ways to complement scaling. Whether scaling works as smoothly for robots as for language is still an open question.

The Lesson of Language Models, and Its Limits

The scaling laws observed in language models (grow data, model, and compute and loss decreases predictably) also inspire robots. But for a few reasons they may not transfer as-is.

- **Data bottleneck**: Robot data must be gathered physically, so it is hard to grow infinitely like text.

- **Ambiguity of evaluation**: Language has a clean objective of next-token prediction, but a robot's "success" is complex to define.

- **Physical constraints**: However good a policy is, it cannot exceed the hardware's physical limits.

Even so, the direction holds. The tendency that more and more varied data, larger models, and better learning methods broaden the generalization of robot policies is observed in various studies. But "how much and how" it works is not settled as cleanly as for language.

Challenges: Data, Safety, Evaluation

Robot foundation models have challenges as large as their promise.

Data

As said, robot data is expensive to gather. Data must be built up by a person teleoperating a real robot or by demonstration, and securing diversity (many objects, environments, robots) is especially hard. Simulation helps, but at the cost of the sim2real gap.

Safety

A robot that actually moves in the physical world can break objects or hurt people if it errs. A language model producing a wrong sentence and a robot moving with the wrong force are different kinds of risk. So force/speed limits, emergency stops, and contact-safety design must be considered together with the policy itself.

Evaluation

Fairly measuring "how good this policy is" is also hard. It is hard to score with fixed benchmarks like a language model, and real-robot experiments are tricky to reproduce. Building standardized task sets, success-rate definitions, and reproducible evaluation protocols is very important for the maturity of this field.

┌──────── three challenges of robot foundation models ────────┐

│ │

│ data ──▶ expensive/slow collection, hard diversity │

│ safety ──▶ physical risk, force/contact/e-stop design │

│ evaluation ──▶ hard to reproduce, immature benchmarks │

│ │

│ all three must be solved to reach practical generalization│

└──────────────────────────────────────────────────────────────┘

Deployment and Inference: Running on Real Hardware

Running a large foundation model on an actual robot comes with practical constraints. A robot must move in real time, so however smart the model is, it is useless if inference is slow.

┌──────────── constraints of real deployment ────────────┐

│ │

│ real-time ──▶ the model must respond within the cycle │

│ compute ──▶ limits of the computer on the robot │

│ latency ──▶ observation→action latency must be small │

│ safety ──▶ defense against abnormal outputs needed │

│ │

└──────────────────────────────────────────────────────────┘

So in practice several trade-offs are used. Distilling a large model into a small, fast one; separating slow high-level planning from fast low-level execution (the hierarchy seen earlier); or quantizing to lighten the model. The System 1 / System 2 structure GR00T N1 claims can be understood in this context too — separating parts that need fast reaction from parts that need slow planning.

Revisiting Safety

The safety of robot foundation models is qualitatively different from that of language models. A language model's wrong output is text, but a robot's wrong output is physical motion.

- **Output validation**: Check before execution whether the action the policy produced is within a physically safe range (speed, force, joint limits).

- **Emergency stop**: A person must be able to stop the robot immediately at any time.

- **Contact safety**: Hardware and software together prevent excessive force in contact with people or objects.

- **Out-of-distribution awareness**: On meeting an unfamiliar situation never seen in training, it is safer to stop or ask for help than to force an action.

The key is that safety cannot be guaranteed by the policy's intelligence alone. You need a double safety net that places separate safety layers above and below the policy so that even if the policy errs, it does not lead to physical harm.

Seeing It in a Small Case: "Clear the Table"

Let us narrow the abstract talk into one picture. Imagine how a generalist policy would handle the instruction "clear the things off the table."

instruction: "clear the table"

│

▼

[foundation policy understands the scene]

│ recognize cup, plate, spoon; infer each destination

▼

sequential execution:

pick up the cup ──▶ to the sink (hold grasp by touch)

pick up the plate ──▶ to a set location (watch for slip)

gather the spoons ──▶ into the holder (precise manipulation)

│

▼

re-check the scene ──▶ if items remain, repeat

Here the power of the foundation model shows. The common sense to unfold the abstract language instruction "clear" into concrete objects and destinations comes from web data. And the manipulation skill to actually pick up and move each object comes from robot data. The combination of the two is the picture the generalist policy aims at. But this is an ideal scenario, and reaching this level of reliability in reality is still a hard challenge.

Outlook

Robot foundation models are still early. But the direction is clear. Gathering large-scale, diverse data into a common format, sharing many robots' experiences via cross-embodiment, and connecting the achievements of language and vision to action via VLA — these trends keep getting stronger.

The realistic picture of the near future is probably not a "fully omnipotent robot" but a policy that **generalizes broadly yet adapts quickly to a specific task with a small amount of data**. Equipped with broad common sense from web knowledge, learning manipulation from robot data, and fine-tuning at a new site with a bit of demonstration. That combination is closer to the realistic balance point between practicality and generalization.

Learning from Human Video

If robot data is scarce, could we not use the **human video** overflowing on the internet? Videos of people cooking, assembling, and handling objects are vast. If we could extract manipulation knowledge from these videos, the data bottleneck could be greatly eased.

human video (vast) ──▶ knowledge of "what to do and how"

│

▼ a bridge to the robot body is needed

│

robot execution data (small) ──▶ correction that fills the body difference

│

▼

broad human knowledge + concrete robot execution

The difficulty is clear. A human hand and a robot gripper differ, video has no force information, and the viewpoint differs from the robot's. So human video cannot be used directly as a policy; approaches are researched that extract the high-level knowledge of "what to do" and combine it with robot data. Human video is a promising breakthrough for the data bottleneck, but filling the body difference is still an open problem.

Efforts Toward Evaluation

We said evaluation is hard. Let us look a bit more concretely at directions to improve it.

- **Standard task sets**: Define a collection of tasks usable in common across many robots and environments.

- **Clarifying success criteria**: Define unambiguously what "success" is (e.g., whether the object was placed at the destination).

- **Reproducible protocols**: Record lighting, objects, and initial placement so another team can reproduce the same conditions.

- **Simulation benchmarks**: To ease the reproduction burden of real experiments, also evaluate in standardized simulation environments.

Once such a shared evaluation foundation is in place, claims that "our model is better" can be fairly verified. Just as the vision and language fields advanced fast thanks to benchmarks, standardizing evaluation is the key to maturity for robots too.

Language Models and Robot Policies: What Is Same and Different

Robot foundation models borrowed the ideas of language models, but there are important differences. Let us summarize in a table.

| Aspect | Language model | Robot foundation model |

| --- | --- | --- |

| Data | Effectively unlimited on the web | Painstakingly collected in the physical world |

| Output | Text (harmless) | Physical motion (possibly dangerous) |

| Evaluation | Relatively clear by benchmark | Hard to standardize/reproduce |

| Cost of error | A wrong sentence | Object damage, safety risk |

| Feedback | Next-token prediction | Physical success/failure |

Because of these differences, you cannot copy the success formula of language models onto robots as-is. Even so, the core philosophy of "train one large model on large-scale, diverse data to generalize broadly" seems a valid direction for robots too. Understanding the differences and adjusting for robots is the challenge of this field.

Misconceptions Around Robot Foundation Models

As a fast-advancing field, exaggeration and misconception are common. Let us note a few to strike a balance.

- **"An omnipotent robot is coming soon"**: Demos are impressive, but reliable generalization to unfamiliar environments and objects is still hard.

- **"Just grow the data"**: Not only amount but diversity and quality matter, and data collection itself is a large bottleneck.

- **"Simulation is enough"**: Simulation is powerful, but the sim2real gap remains and real data is still needed.

- **"One model replaces all robots"**: Realistically, a form that generalizes broadly yet adapts on-site with a small amount of data is more likely.

To understand the technology accurately, you need a cool-eyed view of the distance between an impressive demo and robust practicality.

The Meaning of Open Models

That OpenVLA is an open model has special significance for this field. Just as open models in the language field greatly accelerated research, reproduction, and application, open policies and open data are catalysts for progress in robots too.

open data (Open X-Embodiment) + open model (OpenVLA)

│

▼

anyone can reproduce, verify, improve

│

▼

accelerate the progress of the whole research community

Precisely because collecting robot data is expensive, a culture of sharing data and models is especially valuable. Rather than one institution doing everything alone, the way of gathering data, sharing models, and progressing together has matured this field fast. Of course, commercial and closed trends also strongly exist, and the two trends advance while stimulating each other.

Closing

The goal of "one policy for many jobs" brought language-model-style thinking into robotics. Generalist policies, large-scale data, cross-embodiment, and VLA are the core pieces of that trend. The three challenges of data, safety, and evaluation are still large, but the process of gradually lowering that wall is precisely the present of this field.

Someday, when one robot policy — like a single language model — broadly does the things we ask for in words, research is steadily moving in that direction.

References

- Open X-Embodiment (arXiv): [https://arxiv.org/abs/2310.08864](https://arxiv.org/abs/2310.08864)

- RT-2: Vision-Language-Action Models (arXiv): [https://arxiv.org/abs/2307.15818](https://arxiv.org/abs/2307.15818)

- OpenVLA: An Open-Source Vision-Language-Action Model (arXiv): [https://arxiv.org/abs/2406.09246](https://arxiv.org/abs/2406.09246)

- LoRA: Low-Rank Adaptation (arXiv): [https://arxiv.org/abs/2106.09685](https://arxiv.org/abs/2106.09685)

- Physical Intelligence (π0): [https://www.physicalintelligence.company/](https://www.physicalintelligence.company/)

- Google DeepMind Robotics: [https://deepmind.google/discover/blog/](https://deepmind.google/discover/blog/)

- NVIDIA Isaac / GR00T context: [https://developer.nvidia.com/isaac](https://developer.nvidia.com/isaac)

- Hacker News (robot foundation models discussion): [https://news.ycombinator.com/](https://news.ycombinator.com/)