Comprehensive Survey · Robot Learning · World Models

World Model for Robot Learning: A Comprehensive Survey

A policy-centric survey of predictive world models for robot policy learning, planning, simulation, evaluation, data generation, and robotic video generation.

Bohan Hou1,*,†, Gen Li1,*, Jindou Jia1,*, Tuo An1,*, Xinying Guo1,*, Sicong Leng1, Haoran Geng2, Yanjie Ze3, Tatsuya Harada4, Philip Torr5, Oier Mees6, Marc Pollefeys7, Zhuang Liu8, Jiajun Wu3, Pieter Abbeel2, Jitendra Malik2, Yilun Du9, Jianfei Yang1,†
1Nanyang Technological University, 2University of California, Berkeley, 3Stanford University, 4The University of Tokyo, 5University of Oxford, 6Microsoft, 7ETH Zurich, 8Princeton University, 9Harvard University
*Equal Contribution    Corresponding Author

Abstract

Predictive models for embodied decision making

World models are predictive representations of how environments evolve under actions. In robot learning, they support policy learning, planning, simulation, evaluation, and data generation. This survey reviews how world models are coupled with robot policies, how they serve as learned simulators, and how robotic video world models progress toward controllable, structured, and foundation-scale formulations.

01

World Model for Policy

Architectural paradigms for coupling predictive future modeling with action generation.

02

World Model as Simulator

Learned environments for reinforcement learning, validation, and decision-time evaluation.

03

Robotic Video World Models

Video-based future prediction for imagination, controllable rollout, and data amplification.

Timeline

Policy and simulator evolve together

The timeline is updated through March 31, 2026, and will be continuously maintained.

Recent world-model-based robot learning has progressed along two connected tracks: world models become increasingly integrated with policy generation, while learned simulators move from validation and ranking toward reinforcement learning, post-training, and co-evolving optimization.

Combined policy and simulator timeline
Timeline coverage: through March 31, 2026. The timeline will be continuously maintained.
2023.1–2025.1
Policy
  • Decoupled predict-then-act pipelines
  • Video generation plus inverse dynamics
2024.6–2025.11
Simulator
  • Rollout-based validation
  • Candidate action evaluation and ranking
2025.2–2025.12
Policy
  • Single-backbone and MoE/MoT policies
  • Unified VLA and latent world modeling
2025.11–2026.3
Simulator Policy
  • Learned simulators for RL and post-training
  • Closed-loop world-model / policy optimization

Overview

Survey organization

Survey organization overview
The survey is organized around world models for policy, world models as simulators, and robotic video generation.

Taxonomy

Architectural paradigms for world-model-based policies

The survey separates how predictive generation interacts with action production, ranging from modular predict-then-act pipelines to tightly integrated end-to-end and latent-space formulations.

IDM-style policy architecture

IDM-style policies

A world model predicts future observations, then an inverse-dynamics-style policy recovers actions.

DecoupledPredict then act
Single-backbone policy architecture

Single-backbone policies

Future visual evolution and actions are modeled jointly inside one shared generative backbone.

Shared backboneVideo-action
MoE or MoT-style policy architecture

MoE / MoT-style policies

Specialized video, action, and language experts interact through fusion or shared attention.

Expert fusionJoint attention
Unified VLA architecture

Unified VLA models

Future-oriented prediction is internalized inside a multimodal VLA policy through visual foresight or structured world knowledge.

VLAForesightWorld knowledge
Latent-space world modeling architecture

Latent-space world modeling

Future prediction is represented in compact latent space, reducing the need for explicit pixel-level decoding.

Latent dynamicsJEPA-styleEfficient control

World Model as Simulator

From reinforcement learning to evaluation

The survey also treats world models as learned environments. In this view, a world model can generate imagined transitions for policy improvement, or roll out candidate actions to validate, rank, and evaluate likely outcomes before execution.

World model for reinforcement learning

For reinforcement learning

The world model serves as a learned simulator that produces imagined observations, rewards, and termination signals for policy updates.

Imagined rolloutsPolicy updatePost-training
World model for evaluation

For evaluation

The world model evaluates candidate actions by predicting their consequences, enabling decision-time validation, ranking, and rollout-based scoring.

Action rankingValidationRollout scoring

Robotic Video Generation

World models for robotic video generation

Beyond policy coupling and simulator-style usage, the survey also reviews video-based world models by their modeling capability: from imagination-based generation for policy learning to action-controllable, structure-aware, and foundation-scale formulations.

Robotic video world model taxonomy
A compact visual summary of robotic video world models organized by world-modeling capability.
01

Imagination for policy learning

Video generation synthesizes future task executions and expands supervision for downstream control.

02

Action-controllable world models

Generated futures are conditioned on robot actions, making rollouts useful for planning and evaluation.

03

Structure-aware world models

Models incorporate object, geometry, contact, or physical structure to improve action-relevant prediction.

04

Foundation world models

Large-scale video backbones are adapted into reusable predictive substrates for embodied agents.

Resources

Benchmarks and datasets

For a compact homepage summary, we highlight two dataset/resource tables from the survey: core data attributes and relevance to embodied world-modeling capabilities.

Table 3 dataset and resource attributes
Table 3: core attributes of representative embodied world-model training datasets/resources.
Table 4 dataset and resource relevance to embodied world-modeling capabilities
Table 4: relevance of representative datasets/resources to embodied world-modeling capabilities.

X-Emb.: cross-embodiment coverage. Act.: explicit action supervision or aligned action proxy. Obs./3D: strong observation support beyond basic monocular RGB, e.g., multi-view, depth, LiDAR, or 3D annotations. Lang.: language/task conditioning. M/C: multimodal or contact-rich signals such as force, tactile, audio, or dense proprioceptive/contact cues. ✓ denotes strong support, – denotes partial/moderate support, and ✗ denotes limited or no support.

Open Challenges

Toward actionable world models

Action consistency

Predicted futures should reflect the consequences of candidate robot actions, not only visual plausibility.

Long-horizon stability

Embodied tasks require stable rollouts and reliable planning under compounding prediction errors.

Efficient inference

Video-action diffusion and rollout-based reasoning must become lightweight enough for practical control loops.

Structured abstraction

Object-centric, relational, symbolic, and latent states may be more useful than pixels for planning and control.

Evaluation protocols

Benchmarks should measure functional utility for robot learning, not only future-frame realism.

Citation

BibTeX

@article{hou2026worldmodelrobotlearning,
  title   = {World Model for Robot Learning: A Comprehensive Survey},
  author  = {Hou, Bohan and Li, Gen and Jia, Jindou and An, Tuo and Guo, Xinying and Leng, Sicong and Geng, Haoran and Ze, Yanjie and Harada, Tatsuya and Torr, Philip and Mees, Oier and Pollefeys, Marc and Liu, Zhuang and Wu, Jiajun and Abbeel, Pieter and Malik, Jitendra and Du, Yilun and Yang, Jianfei},
  year    = {2026},
}