Large scale robot learning has recently shown promise in enabling robots to per-
form complex tasks by integrating perception, control, and optionally, language
understanding into a unified framework. However, they continue to struggle with
long-horizon, contact-rich manipulation tasks, such as the handling of deformable
objects, where supervision from demonstrations is often inconsistent in quality. In
such settings, reward modeling offers a natural solution: by providing grounded
progress signals, it can transform noisy demonstrations into stable supervision that
generalizes across diverse trajectories. In this work, we introduce a stage-aware,
video-based reward modeling framework that jointly predicts the high-level task
stage and fine-grained progress within each stage. Reward labels are automatically
derived from natural language subtask annotations, enabling consistent progress
estimation across variable-length and heterogeneous demonstrations. This design
overcomes the limitations of frame-index-based labeling, which collapses in long,
variable-duration tasks such as folding a T-shirt. Our reward model demonstrates
robustness to demonstration variability, generalization to out-of-distribution sce-
narios, and strong utility for downstream policy training. Building upon this re-
ward model, we propose the Reward-Aligned Behavior Cloning (RA-BC) frame-
work, which selectively filters high-quality data and reweights training samples
according to reward estimates. Extensive experiments demonstrate that the reward
model outperforms baselines on out-of-distribution real robot policy rollouts and
human demonstration validation. Our approach achieves 83% success on folding
T-shirts from the flattened state and 67% from the crumpled state—dramatically
surpassing vanilla behavior cloning, which attains only 8% and 0% success un-
der the same training dataset, respectively. Overall, our results highlight reward
modeling as a key enabler for scalable, annotation-efficient, and robust imitation
learning in long-horizon robotic manipulation.