Fine-tuning vision-language-action (VLA) policies for long-horizon
manipulation still largely depends on behavior cloning, which requires costly
high-quality demonstrations and limits policies to the demonstration distribution.
Reward models can reduce this dependence by reweighting demonstrations and
providing dense supervision for on-robot reinforcement learning (RL). However,
an effective reward model must be dense, accurate, and general. Existing meth-
ods do not satisfy all three: task-specific stage-aware models are dense and accu-
rate but require per-task annotations, while general vision-language-model (VLM)
based reward models are broadly applicable but too coarse and noisy for fine-
grained long-horizon progress. To close this gap, we introduce SARM2, a multi-
task stage-aware reward model that pairs a general action-primitive based stage
estimator with a multi-gate Mixture-of-Experts (MMoE) value head, producing
dense per-step rewards across multiple manipulation tasks. Built on SARM2’s ac-
curate reward, we further design SPIRAL (Self-Policy Improvement via Reward-
Aligned Learning), an on-policy reward-guided framework that refines a VLA
policy from cheap autonomous rollouts using SARM2’s dense rewards. On a
10-task benchmark, SARM2 reduces value-estimation MSE by 80% over the
strongest baselines; plugged into SPIRAL, it boosts task success from around 50%
to nearly perfect on both Folding Shorts (58% → 100%) and Cleaning White-
board (50% → 90%), evidencing that high-quality dense rewards are a critical
ingredient for a stable robot data flywheel.