Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

People picture the outcome in their heads before acting. Will the cup tip over if I push it, will the door open if I pull it? We hold an intuitive model of how the world will react. Thanks to this internal model, we can filter out dangerous actions by imagination alone, without actually performing them.

Endowing robots with this ability is the aim of world models. A world model is a predictive model that learns the dynamics of the environment, namely "if I take this action in the current state, what will the next state be." With such a model, instead of costly trial and error in the real environment, a robot can plan by imagining the future inside the learned model.

This article covers the concept of world models, model-based reinforcement learning, prediction in latent space, the role of video prediction and generative models, and planning via imagined rollouts and MPC. Finally, we look at applications to locomotion and manipulation and the current limits. For accuracy, we mention specific model numbers only when well established, and generalize where things are uncertain.

What Is a World Model

The core of a world model is prediction. It learns a function that takes a state and action and predicts the next state (and often the reward).

Basic Structure of a World Model

─────────────────────────────────

current state s ──┐

├──▶ [ world model ] ──▶ predicted next state ŝ'

action a ─────────┘ ──▶ predicted reward r̂

Applying this model repeatedly lets us roll the future forward

several steps by "imagination":

s ──a1──▶ ŝ1 ──a2──▶ ŝ2 ──a3──▶ ŝ3 ── ... (imagined rollout)

Contrasting this with model-free reinforcement learning makes the difference clear. Model-free methods learn a policy or value function only from experience of direct interaction with the environment. Model-based methods, by contrast, first learn a dynamics model of the environment, then use that model to plan or to generate additional experience in imagination.

Model-free vs Model-based

──────────────────────────

[Model-free]

env interaction ──▶ experience ──▶ learn policy/value directly

(needs lots of experience, but simple)

[Model-based]

env interaction ──▶ experience ──▶ learn world model

│

▼

imagine/plan inside model ──▶ improve policy

(data-efficient, but vulnerable to model error)

The greatest appeal of the model-based approach is data efficiency. Once a world model is learned, countless virtual experiences can be generated cheaply inside it without moving the real robot.

The Lineage of the World Model Idea

The concept of a world model did not appear out of nowhere; it stands on a long research lineage. Tracing its roots along a few strands helps in understanding the field.

The first root is control theory. The tradition of building a system's dynamics model and computing optimal control on top of it has long been a foundation of robotics. That said, classical control has a human building the model by hand, distinguishing it from modern world models that learn the model from data.

The second root is the "internal model" concept from psychology and cognitive science. The view that humans and animals hold an internal representation of the world and predict the future provided motivation for making robots learn predictive representations.

The third root is neural-network sequence prediction. Recurrent neural networks and later various generative models provided tools for learning from data the ability to predict what comes next from a sequence of observations.

The Three Roots of the World Model Idea

─────────────────────────────────────────

control theory cognitive science NN sequence prediction

(dynamics model) (internal model) (learn prediction from data)

│ │ │

└───────────────┼────────────────────┘

▼

modern learned world model

(learn dynamics from data, plan by imagination)

The research widely known around 2018 under the title "World Models" impressively demonstrated the idea of compressing an environment with a neural network and learning a policy inside it, popularizing this trend. Since then, various works including the Dreamer family have elaborated imagination-based learning in latent space.

Prediction in Latent Space

Early world models tried to predict the next image at the pixel level. But predicting a high-resolution image pixel by pixel is computationally heavy, and trying to match every detail irrelevant to prediction tends to miss the dynamics that actually matter.

The solution is latent-space prediction. First an encoder converts the high-dimensional observation (such as an image) into a low-dimensional compressed representation (the latent state). Then dynamics prediction happens within this compressed latent space. The future is rolled forward in a light abstract space rather than the heavy pixel space.

Latent-Space World Model

─────────────────────────

observation o ──[ encoder ]──▶ latent state z

│

│ action a

▼

[ latent dynamics ]──▶ next latent z'

│

▼

(if needed) [ decoder ]──▶ predicted observation ô'

Key: instead of heavy pixels, quickly imagine multiple future

steps in the light latent z space.

This approach, represented by the Dreamer family, is known for greatly improving data efficiency by learning a policy through imagined rollouts in latent space. The detailed architecture may vary by version, so it is best to check the original paper for specifics.

Video Prediction and Generative Models

Recently, large-scale video prediction and generation models have drawn attention as world models for robots. The idea is this. A model trained on vast video data to predict "what happens next" comes to embody rich physical common sense about how objects move and interact. Conditioning this predictive ability on robot actions turns it into a tool for imagining how future video unfolds depending on the robot's actions.

Action-Conditioned Video Prediction

─────────────────────────────────────

current frames ──┐

├──▶ [ video prediction model ]──▶ predicted future frames

candidate action ┘

sequences

Predict the future for multiple candidate actions and

select the action sequence that best fits the goal.

The advantage of these generative world models is that complex interactions can be learned from data without a human hand-coding the physics. The limits, however, are that error accumulates as prediction lengthens, and physical laws are not always obeyed exactly.

Planning in Imagination — MPC and Rollouts

The true value of a world model shows in planning. A representative method is Model Predictive Control (MPC).

The MPC procedure is as follows. From the current state, simulate several candidate action sequences by looking ahead with the world model. Evaluate the future each sequence brings and its reward. Pick the best sequence and actually execute only its first action. After executing one step, repeat the whole process from scratch in the new state.

MPC Planning Loop (looking ahead by imagination)

─────────────────────────────────────────────────

[1] Generate several candidate action sequences from current state

candidate A: a1 a2 a3 ...

candidate B: a1'a2'a3'...

candidate C: ...

│

▼

[2] Imagined rollout of each candidate with the world model

s ──▶ ŝ1 ──▶ ŝ2 ──▶ ŝ3 (per candidate)

│

▼

[3] Select the candidate with the highest predicted reward

│

▼

[4] Execute only the first action of that candidate

│

└──▶ back to [1] in the new state (repeat)

Because MPC re-plans every step, even if a prediction is slightly off, the trajectory can be corrected with real observations at the next step. This re-planning property buffers the world model's prediction error to some degree.

Another approach is to learn a policy directly through imagined rollouts. Instead of the real environment, generate countless virtual episodes inside the world model and run reinforcement learning within them to improve the policy. This secures a large amount of training experience without wearing out the real robot.

How Is a World Model Trained

The world model itself is ultimately learned from data. The basic ingredient of learning is the trajectories the robot leaves behind as it interacts with the environment: records of what next state and reward resulted from taking what action in what state.

The learning objective is generally split into three losses.

- Reconstruction/prediction loss: Reconstructs the observation from the encoded latent state, or predicts the next observation. This makes the latent representation embody the world's information.

- Dynamics prediction loss: Accurately predicts the next latent state from the current latent state and action. This is the core of the world model.

- Reward prediction loss: Predicts the reward from the latent state. This is used to evaluate which future is good during planning.

The Three Losses of World Model Training

─────────────────────────────────────────

obs o ──[encoder]──▶ z ──┬──[decoder]──▶ ô (reconstruction/prediction loss)

│

action │

a ▼

[dynamics]──▶ ẑ' (dynamics prediction loss)

│

▼

[reward pred]──▶ r̂ (reward prediction loss)

Minimize the three losses together to learn a latent

representation that is predictable and also embeds reward.

An important design element here is the recurrent structure. A robot's observations are often partial observations. From a single frame you cannot know an object's velocity or its occluded parts. So many world models maintain a recurrent state that summarizes the flow of the past, accumulating information over time. A structure that uses this recurrent state together with stochastic latent variables is widely employed.

Handling Prediction Uncertainty

A world model's predictions can be wrong, and are more often wrong in regions where training data is scarce. The problem is that the policy can blindly trust these wrong predictions and learn dangerous actions that look good only inside the model.

To mitigate this, there are approaches that explicitly handle prediction uncertainty.

- Ensembles: Train several world models together and judge situations where their predictions diverge greatly from one another as "uncertain."

- Conservative planning: In regions of high uncertainty, do not trust optimistic rewards and bias the plan toward the safe side.

Detecting Uncertainty with an Ensemble

────────────────────────────────────────

For the same (state, action):

model1 ──▶ prediction A

model2 ──▶ prediction A' predictions similar ──▶ trustworthy

model3 ──▶ prediction A''

model1 ──▶ prediction B

model2 ──▶ prediction X predictions diverge ──▶ uncertain (caution)

model3 ──▶ prediction Y

Handling uncertainty is key to making a world model actually trustworthy. A model that "knows what it does not know" will not make reckless plans in regions where it is not confident.

World Model vs Explicit Simulator

A world model and a sim-to-real simulator are alike in that both are "tools for predicting the future," but there is a fundamental difference.

| Item | Explicit simulator | Learned world model |

| --- | --- | --- |

| How it is built | Human codes the physics | Learned from data |

| Source of accuracy | Physical laws, parameters | Observed experience |

| New objects/phenomena | Human must model them | Auto-reflected if in the data |

| Main weakness | Reality gap, modeling effort | OOD fragility, prediction error |

| Use without data | Possible (pre-built) | Not possible (needs experience) |

The two approaches are complementary rather than opposed. An explicit simulator provides prior knowledge cheaply and in bulk, while a learned world model absorbs from data the real-world complexity a human could not model. In practice, a combination is possible: rough learning with the simulator, then correcting the world model with real experience.

Application to Robots

World models apply to a variety of robot tasks.

- Legged locomotion: A model that predicts the dynamics of terrain and contact can help filter out dangerous footsteps by imagination before losing balance.

- Manipulation: Predicting the outcome of pushing or grasping an object lets you compare several grasp strategies by imagination before actually trying them.

- Navigation: Predicting future observations along a path lets you evaluate obstacle-avoiding routes in advance.

- Tool use: In tasks that transmit force to an object through a tool, you can preview the tool-tip interaction outcome by imagination.

The core benefit is common: much of the expensive and dangerous real trial and error can be replaced by cheap and safe imagination.

The World Model's Role Per Task

─────────────────────────────────

locomotion ──▶ predict terrain/contact ──▶ avoid dangerous footsteps

manipulation──▶ predict grasp outcome ──▶ choose good grasp strategy

navigation ──▶ predict per-route future──▶ choose safe route

tool use ──▶ predict tool interaction ──▶ adjust how force is applied

▶ Common principle: filter outcomes by imagination before actually trying.

From the Perspective of Data Efficiency

The greatest practical value of a world model is data efficiency. Whereas the sim-to-real discussed earlier relies on a human-built simulator, a world model differs in that the robot learns the simulator itself from data.

Data Efficiency Comparison (conceptual tendency)

──────────────────────────────────────────────

model-free RL : real experience ██████████████████ (much needed)

model-based RL : real experience ████ (little needed)

+ imagined exp ░░░░░░░░░░░░░░ (cheap, in bulk)

▶ Replace much of real experience with imagined experience

Of course this is a conceptual tendency; actual efficiency varies greatly with the task and model quality.

The Planning-Horizon Trade-off

A decision you inevitably face when planning with a world model is the length of the planning horizon: how many steps into the future to imagine and evaluate.

A short horizon accumulates little prediction error, so each prediction is accurate, but it cannot see far ahead and may make myopic decisions. A long horizon can consider even distant future outcomes, but error accumulates over many steps and the imagined future loses trustworthiness.

The Planning-Horizon Trade-off

────────────────────────────────

Short horizon: s ─▶ ŝ1 ─▶ ŝ2 accurate, but myopic

(small error)

Long horizon: s ─▶ ŝ1 ─▶ ... ─▶ ŝ10 sees far, but error accumulates

(trust drops)

▶ Usually balanced with a medium horizon + frequent re-planning (MPC).

A practical way to handle this trade-off is combination with a learned value function. Imagine explicitly only up to a short horizon, and approximate the distant future value beyond that with a separately learned value function. This suppresses error accumulation while still reflecting long-term outcomes to some degree.

Short Imagination + Value Function for Long-Term Approximation

──────────────────────────────────────────────────────────────

s ─▶ ŝ1 ─▶ ŝ2 ─▶ ŝ3

│

▼

[ value function V(ŝ3) ] ← approximates value of the far future beyond

total evaluation = (sum of imagined short-term reward) + (value function's long-term estimate)

The Rise of Generative World Models

Entering the mid-2020s, a trend of using large generative models as world models became prominent. A video generation model trained on vast internet video comes to embody a good deal of the world's physical common sense: objects falling, colliding, flowing, and so on.

To use such a model as a robot's world model, two things are needed. First, conditioning it on the robot's actions so it predicts "how the future changes if I take this action." Second, attaching a mechanism that judges reward or goal achievement so the prediction can be used in planning.

Using a Generative World Model for Planning

─────────────────────────────────────────────

goal image/instruction ──┐

├──▶ [predict future video for action candidates]

current observation ─────┘

│

▼

select the action candidate that produces

the future closest to the goal ──▶ execute

The appeal of this approach is inheriting rich common sense from vast video data without a human coding the physics. Still, a limit remains: even if the generative model's predictions look plausible, they can be physically inaccurate or lose consistency, and these problems grow as prediction lengthens. Detailed capabilities and performance may vary greatly by model and version, so it is best to consult the official sources for specifics.

How to Evaluate a World Model

Evaluating a world model's performance is yet another problem distinct from policy evaluation. We look at it from two perspectives.

The first is prediction accuracy: how well the model's predicted next state or observation matches reality. That said, single-step prediction being accurate does not guarantee that multi-step rollouts are accurate, so you must also look at prediction error across multiple horizons.

The second is downstream performance. Ultimately the world model is a means to build a better policy, so how well the policy trained or planned with that model actually works is the most important metric. Even with some prediction error, it is often enough if the model captures just the core dynamics needed for planning.

Two Perspectives of World Model Evaluation

────────────────────────────────────────────

[1] prediction accuracy ──▶ predicted state vs actual state

│ (across horizons)

▼

[2] downstream performance ──▶ actual success rate of policy from this model

▶ Even if prediction is not perfect, a model useful for planning is a good one.

This distinction matters because pursuing pixel-perfect prediction does not necessarily lead to a good policy. Capturing the information needed for planning is often more important than visually perfect reproduction.

Summary of Related Concepts

Let us briefly summarize the concepts that have appeared so far.

- Model-based RL: The reinforcement learning family that learns a world model and improves a policy with it.

- Latent state: A low-dimensional representation compressing high-dimensional observations, in which prediction takes place.

- Imagined rollout: Rolling the future forward several steps inside the world model instead of the real environment.

- MPC: Imagining a short horizon to pick the best action, then re-planning after executing one step.

- Planning horizon: The length of how many steps into the future to imagine, a trade-off between accuracy and myopia.

These concepts intertwine into a single picture. Roll imagined rollouts on top of a latent state, plan on top of that with MPC or policy learning, and carefully handle the planning horizon and uncertainty. That is the big picture of world-model-based robot learning.

Pitfalls and Limits

- Accumulation of model error: A world model is not perfect. The more prediction steps are chained, the more small errors pile up, and the imagined future can diverge greatly from reality.

- Model exploitation: A policy can exploit the prediction gaps of the world model, learning unrealistic actions that earn high reward only inside the model. They do not work in reality.

- Out-of-distribution situations: In novel situations absent from the training data, the world model's predictions are hard to trust.

- Difficulty of long-horizon prediction: The further into the future, the greater the prediction uncertainty. So it is usually safer to re-plan frequently over a short horizon.

Because of these limits, in practice it is common to buffer the error by combining the world model with MPC re-planning or correction through real observations, rather than treating it as a cure-all.

Relationship to Other Learning Approaches

How does the world model blend with the imitation learning, reinforcement learning, and sim-to-real covered in the earlier articles?

It meets imitation learning like this. Demonstration data is good material for learning a world model. Learn the world's dynamics from trajectories of a human operating the robot, and planning on top of it lets you imagine behaviors that even surpass the demonstrations.

It combines directly with reinforcement learning under the name model-based RL. The imagined experience the world model provides greatly alleviates RL's data-efficiency problem.

It is complementary to sim-to-real. If sim-to-real is the problem of fitting a human-built simulator to reality, a world model is the robot learning the simulator itself from data. Combining the two, you can also roughly learn with an explicit simulator and then refine the world model with real data.

The Four Axes of Robot Learning and the World Model

─────────────────────────────────────────────────────

imitation learning ──┐

reinforcement learning┤

sim-to-real ┤──▶ the world model combines with these to

world model ──┘ provide imagination, planning, data efficiency

▶ The four axes are not substitutes but tools used together.

In this way, the world model is less a standalone technique than a common infrastructure that combines with other learning approaches to strengthen a robot's learning and planning.

Conclusion

World models are an attempt to endow robots with the ability to "imagine before acting." With the environment's dynamics learned, a robot can replace much of costly real trial and error with cheap imagination and plan better by looking ahead. Latent-space prediction makes this efficient, large-scale video prediction models embody rich physical common sense from data, and MPC and imagined rollouts connect those predictions to actual behavior.

At the same time, the fundamental limits of accumulating model error and out-of-distribution situations are clear. A world model is not a finished solution but an active research field aiming to make robots understand and predict the world better. Together with imitation learning, reinforcement learning, and sim-to-real, world models form another important axis of how robots learn.

References

- World Models paper (Ha and Schmidhuber, arXiv): https://arxiv.org/abs/1803.10122

- DreamerV3 paper (arXiv): https://arxiv.org/abs/2301.04104

- OpenAI Spinning Up (intro to RL): https://spinningup.openai.com/

- NVIDIA Isaac / robotics: https://developer.nvidia.com/isaac

- Open X-Embodiment paper (arXiv): https://arxiv.org/abs/2310.08864

- RT-2 paper (arXiv): https://arxiv.org/abs/2307.15818

- MuJoCo physics engine: https://mujoco.org/

- Gymnasium (RL environments): https://gymnasium.farama.org/