💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction: A Bridge Between Perception and Action

For a long time, robot learning meant stitching together three separate pipelines. A perception module interpreted what the camera saw, a planning module decided what to do, and a control module actually moved the joints. Each was designed and trained separately. Information was lost at the boundaries between stages, and the moment a new object or a new instruction appeared, the whole pipeline tended to fall apart.

Vision-Language-Action (VLA) models attempt to unify these three stages into a single neural network. They take a camera image plus a natural-language instruction such as "pick up the red cup and put it on the plate" and directly output the action the robot should take next. The core insight is simple: a vision-language model (VLM) pretrained on web-scale image-text data already holds rich common sense and generalization ability about the world, so we only need to teach it one additional output modality — "action."

This post walks through RT-2, which opened the VLA paradigm; Open X-Embodiment / RT-X, which laid the data foundation; and OpenVLA, which reproduced and extended these ideas in open source. We stay grounded in confirmed facts and, since versions and detailed numbers can differ by source, cover only what is well established.

The VLA Paradigm

Treating Actions as Tokens

A language model views text as a sequence of tokens and predicts the next token. The core idea of VLA is that a robot's action can likewise be expressed as a sequence of tokens. A robot arm's action is usually a continuous-valued vector:

action = [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]

└─── end-effector position/orientation delta ───┘ └ grip ┘

If we split each dimension of this 7-D continuous vector into bins and convert it into integer tokens, an action becomes a sequence of integer tokens. Then the problem "look at the image and instruction and predict the next action token" has exactly the same shape as the "next-token prediction" problem language models already solve.

┌─────────────────────────────────────────────────────────┐

│ Basic VLA Inference Flow │

└─────────────────────────────────────────────────────────┘

Camera image Language instruction

(observation o_t) "pick up the red cup"

│ │

▼ ▼

┌──────────┐ ┌──────────┐

│ Vision │ │ Text │

│ encoder │ │ tokenizer│

└────┬─────┘ └────┬─────┘

│ vision tokens │ language tokens

└─────────┬──────────┘

▼

┌──────────────────┐

│ Large Transformer│

│ (VLM backbone) │

└────────┬─────────┘

▼

┌──────────────────┐

│ action tokens │ ──▶ [126, 14, 200, ...]

└────────┬─────────┘

▼

┌──────────────────┐

│ de-tokenize │ ──▶ continuous action vector

└────────┬─────────┘

▼

robot executes a_t

Why Start From a VLM

The reason to start from a VLM is transfer. A model trained on web image-text pairs already knows concepts like "cup," "left," "stack," and "red." Robot data alone struggles to provide such broad semantic knowledge, because robot demonstration data is extremely expensive to collect and therefore limited in quantity. By layering robot action on top of a VLM's prior knowledge, the model can generalize to objects and instructions it never saw during training.

RT-2: Fine-Tuning a VLM Into a Robot Policy

RT-2 (Robotic Transformer 2, Google DeepMind, arXiv 2307.15818) is the work that put this paradigm forward in earnest. Its core ideas are:

1. Use a large vision-language model (the PaLI-X / PaLM-E family) as the backbone.

2. Represent robot actions as discrete tokens. Each of the 7 action dimensions is quantized into 256 bins and turned into tokens.

3. Perform co-fine-tuning on web data (VQA, captioning, etc.) together with robot demonstration data.

Emitting Actions Like Text

What is striking about RT-2 is that action tokens are integrated into the model's vocabulary. The model generates actions as if generating a sentence. The output string is conceptually like this:

input: [image] "pick up the bag about to fall off the table"

output (conceptual):

"terminate Δpos_x Δpos_y Δpos_z Δrot_x Δrot_y Δrot_z gripper"

→ "1 135 149 125 124 135 134 141"

└ each number is a quantized action token ┘

Each integer is a predefined bin index, recovered into a real continuous action value via de-tokenization.

Co-Fine-Tuning and Generalization

The most important result RT-2 demonstrated was generalization to novel objects and novel instructions. Because web data was trained jointly, the model tended to understand the meaning of, and act reasonably on, objects it had never manipulated with a robot ("pick up the endangered-animal plush") or instructions requiring symbolic reasoning. In other words, internet-scale knowledge appeared to seep emergently into robot control.

That said, RT-2's weights were not released, it relied on a huge proprietary VLM, and inference was heavy. Two follow-up directions emerged: data-side consolidation (Open X-Embodiment) and open-source reproduction (OpenVLA).

Open X-Embodiment / RT-X: A Cross-Robot Dataset

If RT-2 proposed the model architecture, Open X-Embodiment (arXiv 2310.08864) laid the data foundation. The motivating problem: each lab collected only its own data with its own robot, and a policy trained on that data could not be used on another robot. Data was fragmented by embodiment.

Open X-Embodiment is a large collection that consolidates demonstration data from many institutions and diverse robots into a single standard format — from single-arm manipulators to bimanual robots to mobile platforms.

┌──────────────────────────────────────────────────────────┐

│ Open X-Embodiment: Cross-Robot Data Pooling │

└──────────────────────────────────────────────────────────┘

Lab A robot Lab B robot Lab C robot ... Lab N robot

(7-DoF arm) (bimanual) (mobile base) (other gripper)

│ │ │ │

└─────────────┴──────┬──────┴───────────────────┘

▼

┌───────────────────────┐

│ Standardized format │

│ (obs · action · text) │

└───────────┬───────────┘

▼

┌───────────────────────┐

│ RT-X policy training │

│ (joint multi-embodiment)│

└───────────┬───────────┘

▼

one policy runs across many embodiments

(positive transfer observed)

Policies of the RT-X family trained on this data showed positive transfer: training across many embodiments at once improved performance over training on a single embodiment. What was learned on one robot lifted the performance of another. This mirrors the intuition from NLP that more and more diverse data leads to better generalization. Open X-Embodiment became the de facto standard training data for nearly all subsequent open VLA research.

OpenVLA: An Open-Source 7B VLA

OpenVLA (arXiv 2406.09246) reproduces the vision RT-2 showed and makes it usable by anyone. It is roughly 7 billion (7B) parameters, with weights and training code released. Training used about 970k real robot demonstrations drawn from Open X-Embodiment.

Dual Vision Encoders + Llama 2

A notable choice in OpenVLA's architecture is the use of two vision encoders together.

┌────────────────────────────────────────────────────────────┐

│ OpenVLA Architecture (sketch) │

└────────────────────────────────────────────────────────────┘

input image

│

┌────────┴─────────┐

▼ ▼

┌────────┐ ┌────────┐

│ DINOv2 │ │ SigLIP │

│(spatial│ │(semantic

│/geom.) │ │/lang.) │

└───┬────┘ └───┬────┘

│ feature concat │

└────────┬─────────┘

▼

┌─────────────┐ "pick up the red cup"

│ projector │ │

│ (MLP) │ ▼

└──────┬──────┘ ┌─────────────┐

│ │ text tokens │

└─────┬──────┘ │

▼ │

┌──────────────────────────────┐

│ Llama 2 (7B) backbone │

│ (autoregressive Transformer) │

└──────────────┬───────────────┘

▼

action token prediction

[126, 14, 200, 51, ...]

▼

de-tokenize → continuous action

- **DINOv2**: Strong on spatial and geometric features from self-supervised learning. It captures object location, boundaries, and structure well.

- **SigLIP**: Strong on language-aligned semantic features. It links meaning like "red" and "cup" to pixels.

Combining both encoders lets the model represent "what (semantics) is where (space)," which helps tasks like manipulation that demand precise spatial understanding. The fused visual features pass through a projector (MLP) into the language model's token space and enter the Llama 2 decoder alongside text tokens.

Details of Action Tokenization

Like RT-2, OpenVLA turns continuous actions into discrete tokens. A common scheme quantizes each of the 7 action dimensions into 256 bins. A frequent trick is to repurpose rarely used tokens in the language model's existing vocabulary as action tokens.

Tokenizing action dimension d

value range: [v_min, v_max] split into 256 bins

v_min ├─┬─┬─┬─ ... ─┬─┤ v_max

0 1 2 3 255 ← bin index = action token

Using quantile-based boundaries divides bins to fit the

data distribution, reducing quantization error

inference: model predicts an index ──▶ map back to bin center

Discretization has the big advantage of reusing the language-model architecture as is, but it incurs quantization error and forces autoregressive resolution of one action dimension at a time, ignoring inter-dimension correlations. This limitation motivates the continuous-action models (Diffusion Policy, π0) discussed later.

Efficient Adaptation: LoRA and Quantization

Full fine-tuning of an entire 7B model for every new robot or task is expensive. OpenVLA showed that parameter-efficient fine-tuning such as LoRA (low-rank adaptation) can adapt to new settings with modest resources. The idea of LoRA can be summarized as:

Conceptual pseudocode: freeze original weights, train only low-rank matrices

(real implementations use libraries such as PEFT)

class LoRALinear:

def __init__(self, base_linear, rank=16, alpha=32):

self.base = base_linear # frozen original weights W

d_in = base_linear.in_features

d_out = base_linear.out_features

self.A = zeros(rank, d_in) # trainable

self.B = zeros(d_out, rank) # trainable

self.scale = alpha / rank

def forward(self, x):

W x + (B A) x * scale

return self.base(x) + (x @ self.A.T @ self.B.T) * self.scale

Because the original weight matrix is left untouched and only the correction term, expressed as the product of two small matrices A and B, is trained, the number of trainable parameters and memory drop sharply. Combined with 4-bit/8-bit quantization, inference and fine-tuning become feasible even on a single GPU (exact memory requirements can vary by configuration).

Comparing the Three Models

| --- | --- | --- | --- |

Detailed specs in the table can vary by source and version. The key distinction is that RT-2 represents the architectural paradigm, Open X-Embodiment the data foundation, and OpenVLA open reproduction with efficient adaptation.

Data Collection and the Training Pipeline

A VLA's performance hinges on data as much as on architecture. Here we organize the whole flow of how robot demonstration data is collected and channeled into training.

How Demonstration Data Is Collected

The most common method is teleoperation, where a person directly operates the robot to demonstrate a task. The operator moves the robot via a space mouse, VR controller, or a separate master arm, and the observations (images) and actions (end-effector commands) along the way are recorded in time order.

┌──────────────────────────────────────────────────────────┐

│ Demonstration Data Collection Pipeline (sketch) │

└──────────────────────────────────────────────────────────┘

human operator

│ (space mouse / VR / master arm)

▼

┌──────────┐ command ┌──────────┐

│ teleop │ ─────────▶ │ robot │

│ interface │ │ (execute)│

└──────────┘ └────┬─────┘

│ obs (cameras) · state (joints)

▼

┌──────────────────┐

│ timestamp sync │

│ (align obs ↔ act) │

└────────┬─────────┘

▼

┌──────────────────┐

│ store episode │ ──▶ training dataset

│ (success/fail label)│

└──────────────────┘

An episode usually consists of a sequence of "observation, action, next observation," with a language instruction attached (e.g., "fold the towel"). Tens of thousands to hundreds of thousands of such episodes become the model's training material.

Representing Observations and Actions

Before training, observations and actions must be organized into a consistent format. Observations comprise camera images (sometimes multiple views) and robot state (joint angles, gripper open/close), and actions are often expressed as the 7-D end-effector command seen earlier.

Data structure of one timestep (concept)

observation o_t:

- images: front camera, wrist camera ...

- state: joint angles, gripper open/close

instruction g:

- natural-language sentence ("pick up the red cup")

action a_t:

- [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]

→ a sequence of (o_t, g, a_t) triples is one episode

Since camera layout, joint count, and gripper type differ by embodiment, a pooled dataset like Open X-Embodiment absorbs these differences into a standard format. That is the foundation that made cross-robot training possible.

The Co-Fine-Tuning Procedure

Co-fine-tuning, which RT-2 and OpenVLA both emphasized, mixes web vision-language data with robot data during training. The key is to show the two kinds of data together at an appropriate ratio so the model learns action without losing semantic generalization.

┌──────────────────────────────────────────────────────────┐

│ Data Flow of Co-Fine-Tuning │

└──────────────────────────────────────────────────────────┘

web VL data robot demo data

(VQA · captioning) (obs · instruction · action)

│ │

└────────────┬─────────────┘

▼

┌──────────────────┐

│ mixed batch │ ratio control (some web, some robot)

└────────┬─────────┘

▼

┌──────────────────┐

│ VLM fine-tuning │ learn text/action tokens with same loss

└────────┬─────────┘

▼

VLA policy complete

Fine-tuning on robot data alone tends to make the model forget the broad knowledge it learned from the web (catastrophic forgetting). Mixing in web data mitigates this forgetting and preserves generalization to "never-before-seen objects."

Evaluation and Real-World Deployment

How to Evaluate a VLA

Evaluating a VLA is inherently hard. Language models are scored on fixed benchmarks, but a robot policy must perform a task in a real physical environment. Common evaluation items are:

- **Success rate**: The fraction of tasks completed to the end. The most direct metric.

- **Generalization evaluation**: Success rate on objects, backgrounds, and instructions not seen during training. It measures VLA's core value.

- **Robustness**: Performance under lighting changes, object-position perturbations, and distractors.

- **Language understanding**: Whether, in the same scene, changing only the instruction branches to the correct task.

Classification of evaluation scenarios (concept)

┌─────────────────────────────────────────────┐

│ in-distribution (within training dist.) │

│ - seen object, seen layout → base success │

├─────────────────────────────────────────────┤

│ out-of-distribution (outside training) │

│ - new object → semantic generalization │

│ - new background → visual robustness │

│ - new instruction → language understanding │

└─────────────────────────────────────────────┘

Real-robot evaluation is hard to reproduce due to environment setup, hardware wear, and operator differences. So simulation benchmarks are used as a supplement, but the sim-to-real gap must always be kept in mind.

Considerations in Real-World Deployment

Deploying a VLA for real beyond a research setting requires extra consideration.

- **Inference latency**: A 7B-scale model is heavy to run. Reducing latency via quantization, caching, and action chunks is key to practicality.

- **Safety guardrails**: Model output can produce dangerous actions (excessive force, collision paths), so the low-level controller imposes velocity/torque limits and collision avoidance.

- **Human in the loop**: In early deployment, a person should supervise and be able to stop immediately when risk arises.

- **Distribution monitoring**: When input deviates greatly from the training distribution, reliability drops, so a mechanism is needed to detect this and act conservatively or hand off to a human.

Safety layers of the deployment stack (concept)

VLA policy output (proposed action)

│

▼

┌──────────────────┐

│ safety filter │ velocity/torque limits, workspace bounds, collision avoidance

└────────┬─────────┘

▼

┌──────────────────┐

│ low-level control │ convert to actual joint commands

└────────┬─────────┘

▼

robot executes ──▶ (human supervision · emergency stop available)

These layers reflect the reality that, however clever the model becomes, the cost of failure in the physical world is high. VLA progress comes together with the maturing of such operational infrastructure, not just model performance.

Strengths and Limitations

Strengths

- **Semantic generalization**: Thanks to web pretraining, the model copes to some degree with unseen objects and instructions.

- **Single-model integration**: Reduces boundary losses across perception, planning, and control, and takes natural-language instructions directly.

- **Scalability**: Performance tends to improve with more data (Open X-Embodiment) and larger backbones.

- **Reproducibility**: When weights and code are open (as with OpenVLA), the barrier to research and application drops.

Limitations

- **Discretization error**: Action tokenization carries quantization error and can be disadvantageous for smooth, precise continuous control.

- **Control frequency**: Generating tokens one at a time autoregressively slows inference, making high-frequency (fast-reacting) control hard.

- **Data cost**: Real robot demonstration data remains very expensive to collect.

- **Safety and reliability**: Failures in the physical world are costly, and behavior on out-of-distribution situations is hard to guarantee.

- **Difficulty of evaluation**: Real-robot evaluation is hard to reproduce and sensitive to environmental differences.

Of these, discretization error and control frequency lead to approaches that generate actions as continuous values (Diffusion Policy, flow-matching-based π0), covered in a follow-up post.

Conclusion

VLA models try to merge "semantic knowledge learned from the web" and "motor skills learned from robot demonstrations" within a single network. RT-2 showed that fine-tuning a VLM into a robot policy yields generalization to novel objects and instructions; Open X-Embodiment demonstrated positive transfer across robots; and OpenVLA reproduced these ideas in open source so anyone can experiment.

Many problems remain — the limits of discrete action tokenization, control frequency, data cost, and safety. Yet the direction of bridging perception and action with a single model seems clear. The next steps are smoother and faster action generation, and extension to complex embodiments like humanoids.

References

- RT-2: Vision-Language-Action Models, arXiv: [2307.15818](https://arxiv.org/abs/2307.15818)

- OpenVLA: An Open-Source Vision-Language-Action Model, arXiv: [2406.09246](https://arxiv.org/abs/2406.09246)

- Open X-Embodiment: Robotic Learning Datasets and RT-X Models, arXiv: [2310.08864](https://arxiv.org/abs/2310.08864)

- LoRA: Low-Rank Adaptation of Large Language Models, arXiv: [2106.09685](https://arxiv.org/abs/2106.09685)

- DINOv2, arXiv: [2304.07193](https://arxiv.org/abs/2304.07193)

- SigLIP: Sigmoid Loss for Language Image Pre-Training, arXiv: [2303.15343](https://arxiv.org/abs/2303.15343)

- Llama 2, arXiv: [2307.09288](https://arxiv.org/abs/2307.09288)

- Open X-Embodiment project page: [robotics-transformer-x.github.io](https://robotics-transformer-x.github.io/)