RM-RL

Role-Model Reinforcement Learning for Precise Robot Manipulation

ICRA 2026 Real-World RL Precision Pick-and-Place
Affiliations: NTU Mars Lab

RM-RL converts real-world trials into supervised learning signals by selecting a role-model action from repeated scenes, improving data efficiency and stability for millimeter-level manipulation.

53%
translation error reduction
20%
rotation error reduction
100%
success w/ pretraining

“When three people walk together, there must be a role model whom I can learn from; I will select the good qualities and follow them.” — Confucius, The Analects

This page accompanies our ICRA 2026 paper “RM-RL: Role-Model Reinforcement Learning for Precise Robot Manipulation”. We study a real-world precision pick-and-place task where small translation and yaw errors can cause failure. Standard real-world RL is data-inefficient, and replay buffers can introduce distribution mismatch. RM-RL selects a role-model action (the highest-reward action among trials sharing similar initial states) and automatically labels peer samples, turning online interaction into supervised targets that replay stably. See the PDF below or use the header link for full details and results.



Project Video. End-to-end demonstration of role-model labeling and precise placement.

Abstract

Precise robot manipulation is critical in laboratory and industrial workflows where millimeter-level errors can invalidate outcomes. However, collecting expert demonstrations is costly, and purely online real-world RL can be slow and unstable. We propose RM-RL, a role-model reinforcement learning framework that periodically identifies a role-model (approximate optimal) action and uses it to label other samples collected under similar initial states, converting real-world RL trials into supervised data. Combined with a hybrid online–offline training recipe, RM-RL improves data efficiency and training stability. In real-world experiments, RM-RL improves translation accuracy by 53% and rotation accuracy by 20%, and reaches up to 100% success on the challenging “cell plate to shelf” task with pretraining.

RM-RL Teaser
Teaser — Precision pick-and-place with repeated trials under similar initial states.

Motivation

Real-world precision manipulation often requires millimeter-level alignment and is sensitive to small pose errors. While online reinforcement learning can in principle optimize for task rewards directly on hardware, it is frequently data-inefficient and unstable. In practice, replay buffers may further hurt learning due to distribution mismatch between newly collected data and older off-policy samples. RM-RL is motivated by a simple question: can we reuse real-world interaction data more effectively by extracting an approximate “best action” from repeated trials under similar initial states and using it as a stable supervised signal during replay?

Method

Paper Overview

RM-RL alternates between online real-world RL and offline supervised replay. Online interaction collects multiple action trials under similar initial states (a “scene”) and evaluates them with a task reward. RM-RL selects the highest-reward action as the role-model action and assigns it as a label to other samples from the same scene. These labeled samples are then replayed using a cross-entropy objective to improve sample efficiency and mitigate instability during learning.

RM-RL pipeline
Fig. 2 — RM-RL pipeline: online RL, role-model labeling, and supervised replay.
Policy network
Fig. 3 — Policy network with image + pose inputs and probabilistic action heads.

Highlights

  • Role-model online labeling: select the highest-reward action under similar initial states as an approximate optimal label.
  • Hybrid online–offline training: combine on-policy updates with supervised replay over labeled real-world samples.
  • Real-world precision manipulation: strong gains in pose accuracy and success rate on a challenging shelf placement task.

Experimental Results

Experimental setup and tasks. We evaluate RM-RL on a real-world precision pick-and-place benchmark using an X-ARM 6 with an overhead camera. Each trial starts from a slightly perturbed pose, and the policy predicts small translation and yaw corrections to place a cell plate into a shelf slot. We report accuracy in \( \Delta x \), \( \Delta y \), and \( \Delta \psi \), along with success rates over repeated trials, compared against online RL baselines and replay-buffer variants.

Experimental setting
Fig. 4 — Real-world experimental setting: X-ARM 6, cell plate shelf, and overhead camera. Red stars mark target positions.

The following videos summarize qualitative performance. originRL and RL with Replay Buffer show failed cases of the original algorithm, while Pretrain + RL and RM-RL (ours) demonstrate successful placement with the proposed method.

origin RL (baseline). Failed case of the original RL.

origin RL with Replay Buffer. Failed case of the original RL with replay buffer.

RM-RL (ours). Successful demo of the proposed RM-RL method.

Pretrain + RL (ours). Successful demo of the proposed pretrained RM-RL method.

Resources

PDF: Use the "Paper (PDF)" link in the header or the embedded viewer above.
Code / Data: Implementation and real-world training utilities will be released after publication.
Contact: Please email the authors for early access or reproducibility questions.

BibTeX

@article{chen2025rm,
  title={RM-RL: Role-Model Reinforcement Learning for Precise Robot Manipulation},
  author={Chen, Xiangyu and Zhou, Chuhao and Liu, Yuxi and Yang, Jianfei},
  journal={arXiv preprint arXiv:2510.15189},
  year={2025}
}
@article{chen2026rm,
  title={RM-RL: Role-Model Reinforcement Learning for Precise Robot Manipulation},
  author={Chen, Xiangyu and Zhou, Chuhao and Liu, Yuxi and Yang, Jianfei},
  journal={IEEE International Conference on Robotics & Automation (ICRA)},
  year={2026}
}