SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

Fine-tuning vision-language-action (VLA) policies for long-horizon manipulation still largely depends on behavior cloning, which requires costly high-quality demonstrations and limits policies to the demonstration distribution. Reward models can reduce this dependence by reweighting demonstrations and providing dense supervision for on-robot reinforcement learning (RL). However, an effective reward model must be dense, accurate, and general. Existing meth- ods do not satisfy all three: task-specific stage-aware models are dense and accu- rate but require per-task annotations, while general vision-language-model (VLM) based reward models are broadly applicable but too coarse and noisy for fine- grained long-horizon progress. To close this gap, we introduce SARM2, a multi- task stage-aware reward model that pairs a general action-primitive based stage estimator with a multi-gate Mixture-of-Experts (MMoE) value head, producing dense per-step rewards across multiple manipulation tasks. Built on SARM2’s ac- curate reward, we further design SPIRAL (Self-Policy Improvement via Reward- Aligned Learning), an on-policy reward-guided framework that refines a VLA policy from cheap autonomous rollouts using SARM2’s dense rewards. On a 10-task benchmark, SARM2 reduces value-estimation MSE by 80% over the strongest baselines; plugged into SPIRAL, it boosts task success from around 50% to nearly perfect on both Folding Shorts (58% → 100%) and Cleaning White- board (50% → 90%), evidencing that high-quality dense rewards are a critical ingredient for a stable robot data flywheel.

SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

Abstract

Overview of SARM2

Results of SARM2 Value Estimation

Stage Estimation

Demo Data Estimation

Policy Rollout Estimation

Folding Shorts

Cleaning Whiteboard

SPIRAL Workflow

Policy Improvement with SPIRAL: Cleaning Whiteboard

Base BC Policy: Troublesome Rollouts

SPIRAL Improvement:

Policy Improvement with SPIRAL: Folding Shorts

Fold flat shorts

Fold Crumbled shorts

Unstopped Continual Rollout of SPIRAL improved Policies

Citations