TF-DP

: Trace-Focused Diffusion Policy for Multi-Modal Action Disambiguation in Long-Horizon Robotic Manipulation

Yuxuan Hu*Xiangyu Chen*Chuhao ZhouYuxi LiuGen LiJindou JiaJianfei Yang†
* Equal contribution. † Corresponding author: jianfei.yang@ntu.edu.sg.
MARS Lab, Nanyang Technological University
MARS Lab Logo


We introduce TF-DP, a diffusion-based policy that conditions action generation on execution history to resolve multi-modal action ambiguity (MA2) in long-horizon manipulation. TF-DP represents historical motion as an explicit trace, projects it into the visual observation space, and induces a trace-focused field to highlight task-relevant regions, improving temporal consistency and robustness under visual disturbances.

TF-DP overview figure

Abstract

Generative model-based policies have shown strong performance in imitation-based robotic manipulation by learning action distributions from demonstrations. However, in long-horizon tasks, visually similar observations often recur across execution stages while requiring distinct actions, which leads to ambiguous predictions when policies are conditioned only on instantaneous observations, termed as multi-modal action ambiguity (MA2). To address this challenge, we propose the Trace-Focused Diffusion Policy (TF-DP), a diffusion-based framework that explicitly conditions action generation on the robot’s execution history. TF-DP represents historical motion as an explicit execution trace and projects it into the visual observation space, providing stage-aware context when current observations alone are insufficient. In addition, the induced trace-focused field emphasizes task-relevant regions associated with historical motion, improving robustness to background visual disturbances. We evaluate TF-DP on real-world robotic manipulation tasks exhibiting pronounced multi-modal action ambiguity and visually cluttered conditions. Experimental results show that TF-DP improves temporal consistency and robustness, outperforming the vanilla diffusion policy by 80.56% on tasks with multi-modal action ambiguity and by 86.11% under visual disturbances, while maintaining inference efficiency with only a 6.4% runtime increase.

Highlights

  • Identify multi-modal action ambiguity (MA2) as a key failure mode of diffusion policies in long-horizon manipulation.
  • Condition action generation on execution traces projected into the visual space, with a trace-focused field that emphasizes motion-relevant regions.
  • Demonstrate strong gains in temporal consistency and robustness to visual disturbances on real-world long-horizon tasks.

Motivation

Long-horizon manipulation requires a single policy to execute temporally consistent action sequences. Diffusion policies conditioned only on instantaneous observations can produce ambiguous actions when visually similar states reappear at different execution stages.
This one-to-many mapping creates multi-modal action ambiguity (MA2), causing unstable action sampling and incorrect ordering that compounds over time, especially under visual disturbances.
TF-DP resolves MA2 by integrating execution traces into the policy, enabling the model to distinguish stages and act consistently within a single reactive policy.

Method

We propose TF-DP, a trace-focused diffusion policy for long-horizon robotic manipulation under multi-modal action ambiguity.

TF-DP aggregates historical robot motions into a compact execution trace, projects it into the current visual observation, and computes a trace-focused field that emphasizes task-relevant regions associated with past motion. The diffusion policy then conditions action generation on both the current observation and the trace cues, producing stage-aware, temporally consistent actions.

Concretely, TF-DP takes multi-view observations (global, side, wrist) and end-effector state, projects historical 3D end-effector positions into the global image, and renders a dense trace-focused field over the projected trajectory. This execution-aware observation guides the denoising process to resolve ambiguity when visually similar states recur.

TF-DP framework diagram

Compared to diffusion policies conditioned only on instantaneous observations, TF-DP explicitly encodes execution history, resolving MA2 within a single reactive policy without relying on hierarchical planners or external high-level reasoning.

During training, trace-focused fields are rendered from full demonstration trajectories; during inference, the trace is updated online step-by-step, enabling closed-loop execution without changing the base policy architecture.

Experimental Results

TF-DP experimental settings

We compare TF-DP against diffusion policy baselines, including history-conditioned variants, to test whether execution traces resolve MA2 and improve robustness under visual disturbances. TF-DP consistently recovers the correct action order in long-horizon execution and remains stable under cluttered scenes.

Experimental setup: a Franka Research 3 arm is observed by three RGB-D cameras (global overhead, side view, wrist). We evaluate three long-horizon tasks—place cube, press keyboard, and pick & place cubes from drawers—designed to contain visually similar states that require different actions.

Qualitative Results

Long-Horizon Consistency

TF-DP trajectory visualization

TF-DP produces temporally consistent trajectories by conditioning on execution traces, while baselines often switch to incorrect action stages under visually similar observations.


Robustness under Visual Disturbances

TF-DP remains stable under cluttered and visually disturbed scenes by leveraging trace-focused fields that emphasize task-relevant regions.

Videos

Each row compares the same manipulation task under two settings: the original scenario and a visually disturbed background. TF-DP maintains stable execution despite clutter and appearance shifts, while preserving task-relevant action sequences.

Task 1 (Original)
Task 1 (Disturbed)
Task 2 (Original)
Task 2 (Disturbed)
Task 3 (Original)
Task 3 (Disturbed)

Quantitative Results

TF-DP is evaluated on real-world robotic manipulation tasks that exhibit pronounced MA2 and visually cluttered conditions. Compared to the vanilla diffusion policy, TF-DP improves success by 80.56% on tasks with MA2 and by 86.11% under visual disturbances, while adding only 6.4% runtime overhead.

TF-DP robustness under visual disturbances
Performance under original scenario.
TF-DP quantitative results
Performance under visual disturbance.

BibTeX

@misc{hu2026tracefocuseddiffusionpolicymultimodal,
      title={Trace-Focused Diffusion Policy for Multi-Modal Action Disambiguation in Long-Horizon Robotic Manipulation}, 
      author={Yuxuan Hu and Xiangyu Chen and Chuhao Zhou and Yuxi Liu and Gen Li and Jindou Jia and Jianfei Yang},
      year={2026},
      eprint={2602.07388},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.07388}, 
}