As models increasingly leverage multi-step reasoning strategies to solve
complex problems, supervising the logical validity of these intermediate steps
has become a critical research challenge. Process reward models address this by
providing step-by-step feedback, but current approaches have two major
drawbacks: they typically function as classifiers without providing
explanations, and their reliance on supervised fine-tuning with static datasets
limits generalization. Inspired by recent advances, we reframe stepwise reward
modeling from a classification task to a reasoning task itself. We thus propose
a generative judge that reasons about the policy model’s reasoning steps (i.e.,
meta-reasons), outputting thinking tokens before delivering a final verdict.
Our model, StepWiser, is trained by reinforcement learning using relative
outcomes of rollouts. We show it provides (i) better judgment accuracy on
intermediate steps than existing methods; (ii) can be used to improve the
policy model at training time; and (iii) improves inference-time search.