💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction: A Story That Begins Where Tokens End

In the previous post, we saw that VLA models turn robot actions into discrete tokens and predict the next token like a language model. This approach is elegant but carries two burdens. First, slicing continuous values into bins introduces quantization error. Second, pulling tokens one at a time autoregressively slows inference, making fast control hard.

For a robot hand to glide smoothly toward a cup and grasp it precisely, the action must be smooth, consistent, and updated quickly enough. So a stream emerges that generates action not as "the next token" but as "the continuous trajectory itself." This post examines its two flagship approaches: Diffusion Policy and π0 (pi-zero).

The core message is this: if you treat action as a generation problem (which continuous trajectory) rather than a classification problem (which token), you can naturally handle multimodality (several correct ways) and smoothness.

Diffusion Policy: Generating Action by Denoising

Core Intuition

Diffusion Policy brings to action generation the idea of diffusion models, which succeeded greatly in image generation. An image diffusion model starts from pure noise and gradually removes noise (denoising) to produce a plausible image. Diffusion Policy applies the same procedure to a "future action sequence" instead of an "image."

┌──────────────────────────────────────────────────────────┐

│ Diffusion Policy's Action Generation │

└──────────────────────────────────────────────────────────┘

observation (image/state) o_t ──┐ (condition)

│

pure noise ▼

A_K ──▶ A_{K-1} ──▶ ... ──▶ A_1 ──▶ A_0

(random) denoise denoise clean action sequence

step step

(a network conditioned on o_t predicts the noise)

A_0 = [a_t, a_{t+1}, ..., a_{t+H-1}] ← future H-step actions (chunk)

Two things matter here:

- **Conditional generation**: The denoising network takes the current observation o_t as a condition and refines the noise into "a plausible action trajectory for this situation."

- **Action chunk**: Instead of a single step's action, it generates a sequence of future H-step actions at once. This yields a temporally consistent, smooth trajectory and reduces the burden of replanning at every step.

Why Diffusion Helps: Multimodality

A chronic difficulty in imitation learning is multimodality. In the same situation, a person might grasp the cup by going left or by going right. Both demonstrations are correct. Plain regression (outputting the mean) tends to produce the wrong action of "going straight through the middle," the average of the two modes.

Comparison of outputs in a multimodal situation

possible correct trajectories: ╲ ╱

╲ (left) ╱ (right)

╲ ╱

●─────● goal

regression (mean) output: ↓ (straight through → collision/fail)

diffusion output: sample from the distribution → consistently

follow either left or right (no averaging)

A diffusion model learns the probability distribution of actions and samples from it, so it can consistently pick one of several modes. This is the core reason Diffusion Policy behaves stably in precise manipulation.

Training Objective

Training is usually done as noise prediction. Known noise is added to a clean action sequence, and the network is trained to predict that noise.

Conceptual pseudocode: one Diffusion Policy training step

(real implementations use a diffusion scheduler and U-Net/Transformer backbone)

def training_step(obs, action_chunk):

action_chunk: [H, action_dim] clean future H-step actions

k = sample_diffusion_step() # pick a denoising step k at random

noise = gaussian_like(action_chunk) # standard normal noise

noisy = add_noise(action_chunk, noise, k) # inject noise per the schedule

the network, conditioned on obs and step k, predicts the noise

pred_noise = denoiser(noisy, k, cond=obs)

loss = mse(pred_noise, noise) # gap between predicted and real noise

return loss

At inference, you start from pure noise and denoise several times with this network to obtain an action chunk. The number of denoising steps balances quality and speed, and fast-with-few-steps techniques (e.g., DDIM-style samplers) are also used.

Control Loop and Action Chunks

Diffusion Policy is usually operated as follows.

┌──────────────────────────────────────────────────────┐

│ Action-Chunk-Based Control Loop (sketch) │

└──────────────────────────────────────────────────────┘

1) collect observation o_t

2) generate a future H-step action chunk A_0 via diffusion

3) execute a leading portion (e.g., the first few steps)

4) periodically re-observe and return to (1) to replan

receding horizon: plan long, execute short, refresh often

→ balance consistency (long plan) and responsiveness (frequent update)

The longer the chunk, the smoother but less sensitive to environment changes; the shorter and more frequent the update, the more responsive but the less consistent the plan. This trade-off is tuned to the nature of the task.

π0: High-Frequency Continuous Actions With Flow-Matching

Background and Positioning

π0 (Physical Intelligence) is a model that aims to combine VLA's generalization ability with the smoothness of continuous action generation. At a high level, it couples a web-pretrained vision-language backbone (semantic understanding) with an "action expert" that quickly generates continuous actions. The goal is to generate action not as discrete tokens but as continuous values, and at high frequency.

┌────────────────────────────────────────────────────────────┐

│ π0 Structure (high-level) │

└────────────────────────────────────────────────────────────┘

image · language instruction

│

▼

┌───────────────────┐

│ VLM backbone │ ← web-pretrained semantic understanding

│ (vision-language) │

└─────────┬─────────┘

│ condition (context) representation

▼

┌───────────────────┐

│ action expert │ ← continuous action via flow-matching

│ │ emits high-frequency continuous chunks

└─────────┬─────────┘

▼

continuous action chunk a_{t..t+H} ──▶ execute via high-frequency control

What Is Flow-Matching

Flow-matching is a generative technique closely related to diffusion. Intuitively, it learns a continuous vector field (velocity field) that "flows" from a simple distribution (e.g., Gaussian noise) to a target distribution (correct action trajectories). At inference, a noise sample is integrated along this vector field to move it to a sample of the target distribution.

Intuition of flow-matching

noise x_0 ──── integrate along learned velocity field v(x, τ) ────▶ x_1

(simple dist.) τ: 0 → 1 (action dist.)

v(x, τ): tells where point x should flow (velocity) at time τ

ODE integration: solving dx/dτ = v(x, τ) from τ=0 to 1 yields a sample

If diffusion is a Markov chain that removes noise step by step, flow-matching does the same thing as a continuous-time ODE flow. Designed well, it can quickly generate smooth continuous outputs even with few integration steps, which favors high-frequency control.

High-Frequency Control and the Action Expert

What π0 emphasizes is "high-frequency continuous action." Precise, fast manipulation (e.g., folding clothes, gently inserting an object) benefits from smooth action updates at the level of tens of times per second. Pulling discrete tokens one at a time autoregressively struggles to reach such frequencies. A flow-matching-based action expert generates a continuous action chunk at once, so it can run a fast control loop while preserving the semantic understanding of the VLM backbone.

Control-frequency intuition (conceptual)

discrete-token autoregression: token → token → token ... (one dim at a time, slow)

hard to react fast

continuous chunk generation: [a_t, a_{t+1}, ..., a_{t+H}] at once

smooth, fast updates possible

* Exact control frequency can vary by hardware, implementation, and task.

Discrete Tokens vs. Continuous Actions

| Item | Discrete action tokens (RT-2, OpenVLA) | Continuous action generation (Diffusion Policy, π0) |

| --- | --- | --- |

| Action repr. | Quantized integer tokens | Continuous-valued trajectory/chunk |

| Generation | Autoregressive next-token prediction | Denoising / flow-matching |

| Multimodality | Distribution via token probabilities | Sample directly from distribution |

| Smoothness | Quantization error possible | Favors smooth continuous output |

| Control frequency | Can slow due to sequential tokens | Chunk-at-once favors high frequency |

| LM reuse | Integrated into vocabulary, reused as is | Separate action head/expert designed |

| Key strength | Simplicity, semantic generalization | Precision, smoothness, responsiveness |

Details in the table can vary by implementation and version. The two approaches are not mutually exclusive; the compromise of layering continuous action generation on top of a VLM's semantic understanding (π0's direction) is being actively explored.

Deeper: Noise Schedules and Sampling

What a Noise Schedule Is

In a diffusion model, a "denoising step" is not a vague stage but is precisely defined by a schedule that determines how much noise is added and removed. The schedule sets the ratio of signal to noise at each step k. Early steps (large k) are almost pure noise, and late steps (k near 0) are almost clean action.

Intuition of a noise schedule

k: K (much noise) ───────────────▶ 0 (no noise)

│ │

signal ratio: low ─────────────────▶ high

noise ratio: high ────────────────▶ low

at each denoising step the network predicts "this step's noise" and removes it

→ over several steps, pure noise converges to clean action

The shape of the schedule (linear, cosine, etc.) affects training stability and sample quality. In action generation, the dimension is small (about 7) and the sequence short, so fewer steps than image generation are often enough.

Inference Steps and the Speed-Quality Trade-off

At inference, the number of denoising steps is the key knob separating speed and quality.

Trade-off by step count (concept)

many steps (e.g., tens to hundreds)

+ more accurate action distribution

- slow inference → lower control frequency

few steps (e.g., single digits)

+ fast inference → high-frequency control possible

- distribution approximation may be coarse

DDIM-style deterministic samplers or distillation are

actively studied to keep quality even with few steps

In robot control, fast updates matter, so choosing a sampler that produces good action with few steps governs practicality. This is exactly where flow-matching's emphasis touches: designed well, flow-matching can produce smooth continuous output with few integration steps, favoring high-frequency control.

The Relationship Between Diffusion and Flow-Matching

Diffusion and flow-matching are connected at a deep level. Both learn a transformation from a simple distribution to a complex target distribution. Diffusion approaches the same goal with a discrete denoising chain, and flow-matching with the velocity field of a continuous-time ODE.

Correspondence of the two generative paradigms (concept)

diffusion (DDPM): discrete Markov chain, step-wise noise removal

deterministic (DDIM): viewing diffusion as an ODE, deterministic integration

flow-matching: learn the ODE velocity field directly from the start

in common: learn a "flow" from noise to the target distribution

difference: chain length, determinism, form of the training objective

In practice, the selection criterion is "which method produces smooth, accurate action fast with few steps." Diffusion Policy offers stability proven in precise manipulation, while the π0 family leads with high-frequency continuous control and VLM combination.

A Practical View: When to Use What

Choosing by Task Character

Decision guide (concept, not an absolute rule)

┌─ precise, multimodal manipulation, data-rich

│ → Diffusion Policy family is stable

│

├─ need high-frequency continuous control (fast hands, smoothness)

│ → flow-matching-based (π0 direction)

│

├─ strong semantic generalization / new-instruction handling is top priority

│ → VLM-backbone VLA (discrete tokens or continuous combination)

│

└─ simple, repetitive tasks, fast prototyping

→ start with a light policy and scale up if needed

This guide is not an absolute rule but a starting point. In reality, you decide by considering data volume, hardware constraints, control-frequency demands, and safety requirements together.

Common Mistakes in Training

- **Insufficient data diversity**: Using only limited demos overfits the model to specific layouts and lighting. Include diverse initial conditions and distractors.

- **Mis-set action-chunk length**: Too long is insensitive to environment change; too short lacks consistency. Tune it to the task's nature.

- **Observation sync errors**: If observation and action timestamps drift apart, causality breaks and training collapses.

- **Untuned sampler**: Without adjusting inference step count and sampler, action comes out slow or coarse.

Training-inference checklist (concept)

[ ] Are the demos' initial conditions / objects / backgrounds diverse enough?

[ ] Is the observation ↔ action timestamp alignment accurate?

[ ] Does the action-chunk length match the task cycle?

[ ] Do the inference steps / sampler satisfy the speed-quality balance?

[ ] Are safety limits (velocity, torque, workspace) at the low level?

Such checks affect actual success rate as much as the choice of model architecture. The benefit of continuous action generation shows only when the data and pipeline are solid.

Strengths and Limitations

Diffusion Policy

- Strengths: Naturally handles multimodal behavior and is stable in precise manipulation. Action chunks secure temporal consistency.

- Limitations: Many denoising steps can slow inference (mitigated by sampler choice), and performance hinges heavily on the quality and diversity of demonstration data.

π0 / flow-matching family

- Strengths: Aims to combine VLM generalization with the smoothness and high frequency of continuous action. A fitting direction for precise tasks that need fast control.

- Limitations: The system is complex, and training/tuning are demanding. Performance and exact specs can vary by implementation, data, and hardware, so reported numbers may reproduce differently across environments.

Common Caveats

- Real-robot evaluation is hard to reproduce and sensitive to environment.

- Safety (out-of-distribution situations, collision avoidance) remains a central challenge.

- Data collection cost is a common bottleneck for all approaches.

Viewing It as a Task Lifecycle

To make the abstract concrete, let us follow how the task "pick up the cup on the table and place it on the adjacent plate" flows through a continuous-action-generation policy.

┌──────────────────────────────────────────────────────────┐

│ Control loop of the cup-move task (timeline) │

└──────────────────────────────────────────────────────────┘

t0 observe: locate cup, plate, hand via camera

│

▼

t0 generate: future H-step action chunk A_0 = [a_t .. a_{t+H-1}]

│ (continuous trajectory via denoising or flow-matching)

▼

t0~ execute: send the leading part (a few steps) to the robot

│ hand approaches the cup smoothly

▼

t1 re-observe: detect the cup slipping slightly

│

▼

t1 re-generate: update the chunk with the new observation → correct path

│

▼

... repeat: grasp → lift → move → place down

│

▼

end: open gripper and judge task complete

Here the benefit of continuous action generation appears. Even if an unexpected change like the cup slipping occurs, the policy rebuilds the chunk at the next observation and corrects smoothly. Updates are faster and trajectories smoother than pulling discrete tokens one at a time.

The Robot Policy as a Real-Time System

The Latency Budget

Robot control is a real-time system. The total latency from observation to action execution governs task stability. Breaking latency down by item:

Decomposing the latency budget (concept)

sensor capture ──▶ preprocessing (image resize, etc.)

│ │

▼ ▼

model inference (gen) ──▶ postprocessing (de-tokenize, safety filter)

│ │

▼ ▼

send to controller ──▶ joint actuation

total latency = sum of the above stages

→ if any one stage is slow, the whole control frequency drops

Generating a continuous action chunk at once lets you prepare the next chunk while consuming the current one, distributing the latency burden instead of re-inferring every step. This is one of the practical secrets that make high-frequency control possible.

Asynchronous Inference and Chunk Consumption

Asynchronous pipeline (concept)

generation thread: [generate chunk A] [generate chunk B]

│ │

▼ ▼

execution thread: ▮▮▮▮▮▮▮▮ (consume A) ▮▮▮▮▮▮▮▮ (consume B)

→ generate B ahead while executing A → seamless control

* Exact timing can vary by hardware and implementation.

Such an asynchronous structure helps keep the control loop smooth even while using a heavy generative model. But if generation is too slow, execution finishes consuming the chunk and waits for the next one, causing stutter, so balancing generation speed and chunk length is important.

Sample Efficiency and Data Diversity

Continuous action generation models are highly expressive, but that means they need enough demonstrations to learn the distribution properly. There are several practical ways to raise sample efficiency.

Strategies to raise sample efficiency (concept)

┌─ leverage pretraining: reuse the VLM backbone's semantic representation

├─ data augmentation: extend demos via color/crop/viewpoint changes

├─ multi-task learning: train several tasks together to induce transfer

└─ efficient fine-tuning: adapt with modest resources via LoRA, etc.

→ directions that supplement the limit of expensive real demos

In particular, reusing the VLM backbone's pretrained representation lets the model learn action while preserving semantic generalization with little robot data. This aligns with the intuition of co-fine-tuning seen in the previous post. Ultimately, the success of continuous action generation depends not only on a good generative technique but on the combination of diverse, well-aligned data and an efficient adaptation strategy.

Conclusion

Discrete action tokenization is an elegant way to bring the power of language models directly to robots, but it pays the price of quantization and control frequency. Diffusion Policy generates action by denoising to solve multimodality and smoothness, while π0 makes continuous action fast with flow-matching and combines it with the VLM's semantic understanding.

The big trend is "treating action as generation, not classification." The more a real-world task needs precise, smooth, fast hand movement, the clearer the benefit of continuous action generation becomes. The next step is extending these action generators to larger backbones, more diverse data, and more complex embodiments (humanoids).

References

- Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, arXiv: [2303.04137](https://arxiv.org/abs/2303.04137)

- π0: A Vision-Language-Action Flow Model for General Robot Control, arXiv: [2410.24164](https://arxiv.org/abs/2410.24164)

- Flow Matching for Generative Modeling, arXiv: [2210.02747](https://arxiv.org/abs/2210.02747)

- Denoising Diffusion Probabilistic Models (DDPM), arXiv: [2006.11239](https://arxiv.org/abs/2006.11239)

- Denoising Diffusion Implicit Models (DDIM), arXiv: [2010.02502](https://arxiv.org/abs/2010.02502)

- OpenVLA: An Open-Source Vision-Language-Action Model, arXiv: [2406.09246](https://arxiv.org/abs/2406.09246)

- RT-2: Vision-Language-Action Models, arXiv: [2307.15818](https://arxiv.org/abs/2307.15818)

- Physical Intelligence blog: [physicalintelligence.company](https://www.physicalintelligence.company/)