SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

¹Stanford University ²UC Berkeley ³xdof.ai

Abstract

Large scale robot learning has recently shown promise in enabling robots to per- form complex tasks by integrating perception, control, and optionally, language understanding into a unified framework. However, they continue to struggle with long-horizon, contact-rich manipulation tasks, such as the handling of deformable objects, where supervision from demonstrations is often inconsistent in quality. In such settings, reward modeling offers a natural solution: by providing grounded progress signals, it can transform noisy demonstrations into stable supervision that generalizes across diverse trajectories. In this work, we introduce a stage-aware, video-based reward modeling framework that jointly predicts the high-level task stage and fine-grained progress within each stage. Reward labels are automatically derived from natural language subtask annotations, enabling consistent progress estimation across variable-length and heterogeneous demonstrations. This design overcomes the limitations of frame-index-based labeling, which collapses in long, variable-duration tasks such as folding a T-shirt. Our reward model demonstrates robustness to demonstration variability, generalization to out-of-distribution sce- narios, and strong utility for downstream policy training. Building upon this re- ward model, we propose the Reward-Aligned Behavior Cloning (RA-BC) frame- work, which selectively filters high-quality data and reweights training samples according to reward estimates. Extensive experiments demonstrate that the reward model outperforms baselines on out-of-distribution real robot policy rollouts and human demonstration validation. Our approach achieves 83% success on folding T-shirts from the flattened state and 67% from the crumpled state—dramatically surpassing vanilla behavior cloning, which attains only 8% and 0% success un- der the same training dataset, respectively. Overall, our results highlight reward modeling as a key enabler for scalable, annotation-efficient, and robust imitation learning in long-horizon robotic manipulation.

SARM's structure

Overview of SARM, stage-aware reward modeling. Left: SARM overview, which in- cludes both a stage estimator and subtask estimator. First the task stage is predicted from the ob- servations. This prediction is additionally passed into the subtask estimator which predicts a scale value of the progress within the stage. Right: An overview of the estimator architecture which is replicated for both the stage estimator and the subtask estimato

Results of SARM

Demo Data Estimation

Examples of SARM's prediction on demonstration data

Video samples of SARM's prediction on demonstration data, delivering accurate and semantically meaningful estimations.

Policy Rollout Estimation

Examples of SARM's prediction on policy rollouts. Compared with human demonstration data, policy rollouts are more challenging because they often include failure modes that are out-of-distribution (OOD), such as misgrasps, recovery attempts, and back-and-forth motions. In the first example, the trajectory corresponds to a successful rollout where the robot folds the T-shirt correctly, with only minor struggles and misgrasps in the first ten seconds. In this case, SARM remains stable, keeping the estimated progress near zero during these OOD motions. The second example highlights a failed rollout, with four key frames: (1) the T-shirt is flattened after struggling, (2) folding is nearly complete, (3) the robot suddenly fails and crumples the T-shirt on the table, and (4) the unfolded T-shirt is placed in the corner. SARM provides reasonable progress estimates across all four stages, reflecting the actual task status.

Leverage SARM for Policy Training

Example of RA-BC trained T-shirt folding policy rollout.

Citations

If you find our work useful in your research, please consider citing:

@misc{chen2025sarm, title={SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation}, author={Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Fred Shentu, Philipp Wu}, year={2025}, eprint={2509.25358}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2509.25358}, }