We introduce MetaStone-S1, a pioneering reflective generative model designed to significantly enhance test-time scaling (TTS) capabilities through the new reflective generative form. This work provides three major contributions:
Reflective Generative Form: By sharing backbone between the policy and process reward model(PRM), we develop a unified interface that efficiently integrates reasoning and evaluation processes, introducing only 53M parameters’ PRM for efficient inference.
Self-supervised Process Reward Model: We introduce a novel self-supervised learning strategy that dynamically assigns outcome rewards to individual reasoning steps without the need of process-level annotations.
Scaling Law and aha-moment: We empirically demonstrate the scaling law between reasoning computation and TTS performance, and find the aha-moment of the Reflective Generative Form. Extensive evaluations on benchmarks such as AIME24, AIME25, LiveCodeBench, and C-EVAL show that MetaStone-S1 consistently achieves state-of-the-art performance compared to larger open-source and closed-source models.
To foster community-driven research, we have open-sourced MetaStone-S1. Code, models, and resources are available at https://github.com/MetaStone-AI/MetaStone-S1.