Comprehensive Survey · Robot Learning · World Models

World Model for Robot Learning: A Comprehensive Survey

A policy-centric survey of predictive world models for robot policy learning, planning, simulation, evaluation, data generation, and robotic video generation.

Bohan Hou^1,*,†, Gen Li^1,*, Jindou Jia^1,*, Tuo An^1,*, Xinying Guo^1,*, Sicong Leng¹, Haoran Geng², Yanjie Ze³, Tatsuya Harada⁴, Philip Torr⁵, Oier Mees⁶, Marc Pollefeys⁷, Zhuang Liu⁸, Jiajun Wu³, Pieter Abbeel², Jitendra Malik², Yilun Du⁹, Jianfei Yang^1,†

¹Nanyang Technological University, ²University of California, Berkeley, ³Stanford University, ⁴The University of Tokyo, ⁵University of Oxford, ⁶Microsoft, ⁷ETH Zurich, ⁸Princeton University, ⁹Harvard University

^*Equal Contribution ^†Corresponding Author

Paper GitHub BibTeX Contact

Abstract

Predictive models for embodied decision making

World models are predictive representations of how environments evolve under actions. In robot learning, they support policy learning, planning, simulation, evaluation, and data generation. This survey reviews how world models are coupled with robot policies, how they serve as learned simulators, and how robotic video world models progress toward controllable, structured, and foundation-scale formulations.

World Model for Policy

Architectural paradigms for coupling predictive future modeling with action generation.

World Model as Simulator

Learned environments for reinforcement learning, validation, and decision-time evaluation.

Robotic Video World Models

Video-based future prediction for imagination, controllable rollout, and data amplification.

Timeline

Policy and simulator evolve together

The timeline is updated through March 31, 2026, and will be continuously maintained.

Recent world-model-based robot learning has progressed along two connected tracks: world models become increasingly integrated with policy generation, while learned simulators move from validation and ranking toward reinforcement learning, post-training, and co-evolving optimization.

Timeline coverage: through March 31, 2026. The timeline will be continuously maintained.

2023.1–2025.1

Policy

Decoupled predict-then-act pipelines
Video generation plus inverse dynamics

2024.6–2025.11

Simulator

Rollout-based validation
Candidate action evaluation and ranking

2025.2–2025.12

Policy

Single-backbone and MoE/MoT policies
Unified VLA and latent world modeling

2025.11–2026.3

Simulator Policy

Learned simulators for RL and post-training
Closed-loop world-model / policy optimization

Overview

Survey organization

The survey is organized around world models for policy, world models as simulators, and robotic video generation.

Taxonomy

Architectural paradigms for world-model-based policies

The survey separates how predictive generation interacts with action production, ranging from modular predict-then-act pipelines to tightly integrated end-to-end and latent-space formulations.

IDM-style policies

A world model predicts future observations, then an inverse-dynamics-style policy recovers actions.

DecoupledPredict then act

Single-backbone policies

Future visual evolution and actions are modeled jointly inside one shared generative backbone.

Shared backboneVideo-action

MoE / MoT-style policies

Specialized video, action, and language experts interact through fusion or shared attention.

Expert fusionJoint attention

Unified VLA models

Future-oriented prediction is internalized inside a multimodal VLA policy through visual foresight or structured world knowledge.

VLAForesightWorld knowledge

Latent-space world modeling

Future prediction is represented in compact latent space, reducing the need for explicit pixel-level decoding.

Latent dynamicsJEPA-styleEfficient control

World Model as Simulator

From reinforcement learning to evaluation

The survey also treats world models as learned environments. In this view, a world model can generate imagined transitions for policy improvement, or roll out candidate actions to validate, rank, and evaluate likely outcomes before execution.

For reinforcement learning

The world model serves as a learned simulator that produces imagined observations, rewards, and termination signals for policy updates.

Imagined rolloutsPolicy updatePost-training

For evaluation

The world model evaluates candidate actions by predicting their consequences, enabling decision-time validation, ranking, and rollout-based scoring.

Action rankingValidationRollout scoring

Robotic Video Generation

World models for robotic video generation

Beyond policy coupling and simulator-style usage, the survey also reviews video-based world models by their modeling capability: from imagination-based generation for policy learning to action-controllable, structure-aware, and foundation-scale formulations.

A compact visual summary of robotic video world models organized by world-modeling capability.

Imagination for policy learning

Video generation synthesizes future task executions and expands supervision for downstream control.

Action-controllable world models

Generated futures are conditioned on robot actions, making rollouts useful for planning and evaluation.

Structure-aware world models

Models incorporate object, geometry, contact, or physical structure to improve action-relevant prediction.

Foundation world models

Large-scale video backbones are adapted into reusable predictive substrates for embodied agents.

Resources

Benchmarks and datasets

For a compact homepage summary, we highlight two dataset/resource tables from the survey: core data attributes and relevance to embodied world-modeling capabilities.

Table 3: core attributes of representative embodied world-model training datasets/resources.

Table 4 dataset and resource relevance to embodied world-modeling capabilities

Table 4: relevance of representative datasets/resources to embodied world-modeling capabilities.

X-Emb.: cross-embodiment coverage. Act.: explicit action supervision or aligned action proxy. Obs./3D: strong observation support beyond basic monocular RGB, e.g., multi-view, depth, LiDAR, or 3D annotations. Lang.: language/task conditioning. M/C: multimodal or contact-rich signals such as force, tactile, audio, or dense proprioceptive/contact cues. ✓ denotes strong support, – denotes partial/moderate support, and ✗ denotes limited or no support.

Open Challenges

Toward actionable world models

Action consistency

Predicted futures should reflect the consequences of candidate robot actions, not only visual plausibility.

Long-horizon stability

Embodied tasks require stable rollouts and reliable planning under compounding prediction errors.

Efficient inference

Video-action diffusion and rollout-based reasoning must become lightweight enough for practical control loops.

Structured abstraction

Object-centric, relational, symbolic, and latent states may be more useful than pixels for planning and control.

Evaluation protocols

Benchmarks should measure functional utility for robot learning, not only future-frame realism.

Citation

BibTeX

@article{hou2026worldmodelrobotlearning,
  title   = {World Model for Robot Learning: A Comprehensive Survey},
  author  = {Hou, Bohan and Li, Gen and Jia, Jindou and An, Tuo and Guo, Xinying and Leng, Sicong and Geng, Haoran and Ze, Yanjie and Harada, Tatsuya and Torr, Philip and Mees, Oier and Pollefeys, Marc and Liu, Zhuang and Wu, Jiajun and Abbeel, Pieter and Malik, Jitendra and Du, Yilun and Yang, Jianfei},
  year    = {2026},
}