SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

1Stanford University, 2UC Berkeley, 3xdof.ai
Accepted to ICLR 2026

Abstract

Large-scale robot learning has made progress on complex manipulation tasks, yet long-horizon, contact-rich problems—especially those involving deformable objects—remain challenging due to inconsistent demonstration quality. We propose a stage-aware, video-based reward modeling framework that jointly predicts task stage and fine-grained progress, using natural-language subtask annotations to derive consistent labels across variable-length demonstrations. This avoids the brittleness of frame-index-based labeling and provides stable supervision even in tasks like T-shirt folding. Our reward model is robust to demonstration variability, generalizes to out-of-distribution scenarios, and improves downstream policy training. Building on it, we introduce Reward-Aligned Behavior Cloning (RA-BC), which filters and reweights demonstrations based on reward estimates. Experiments show that our method significantly outperforms baselines in both real-world rollouts and human validation. On T-shirt folding, we achieve 83% success from the flattened state and 67% from the crumpled state, compared to 8% and 0% with vanilla BC. Overall, our results highlight reward modeling as a scalable and annotation-efficient solution for long-horizon robotic manipulation.


Overview of our method

arch-imag

Overview of our method's framework for (a) data processing, (b) reward model training, and (c) policy training with reward signals.


SARM's structure

arch-imag

Overview of SARM, stage-aware reward modeling. Left: SARM overview, which in- cludes both a stage estimator and subtask estimator. First the task stage is predicted from the ob- servations. This prediction is additionally passed into the subtask estimator which predicts a scale value of the progress within the stage. Right: An overview of the estimator architecture which is replicated for both the stage estimator and the subtask estimato


Results of SARM

Demo Data Estimation

arch-imag

Examples of SARM's prediction on demonstration data



Video samples of SARM's prediction on demonstration data, delivering accurate and semantically meaningful estimations.

Policy Rollout Estimation

arch-imag

Examples of SARM's prediction on policy rollouts. Compared with human demonstration data, policy rollouts are more challenging because they often include failure modes that are out-of-distribution (OOD), such as misgrasps, recovery attempts, and back-and-forth motions. In the first example, the trajectory corresponds to a successful rollout where the robot folds the T-shirt correctly, with only minor struggles and misgrasps in the first ten seconds. In this case, SARM remains stable, keeping the estimated progress near zero during these OOD motions. The second example highlights a failed rollout, with four key frames: (1) the T-shirt is flattened after struggling, (2) folding is nearly complete, (3) the robot suddenly fails and crumples the T-shirt on the table, and (4) the unfolded T-shirt is placed in the corner. SARM provides reasonable progress estimates across all four stages, reflecting the actual task status.

Leverage SARM for Policy Training

arch-imag


Example of RA-BC trained T-shirt folding policy rollout.


Citations

If you find our work useful in your research, please consider citing:


@misc{chen2025sarm,
      title={SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation}, 
      author={Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Fred Shentu, Philipp Wu},
      year={2025},
      eprint={2509.25358},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.25358}, 
    }