Paper Page - ReasonFlux-PRM: Trajectory-Aware PRMs For Long Chain-of-Thought Reasoning In LLMs

ReasonFlux-PRM, a novel trajectory-aware Process Reward Model, evaluates reasoning traces with step-level and trajectory-level supervision, enhancing performance in model distillation, reinforcement learning, and test-time scaling.

Process Reward Models (PRMs) have recently emerged as a powerful framework
for supervising intermediate reasoning steps in large language models (LLMs).
Previous PRMs are primarily trained on model final output responses and
struggle to evaluate intermediate thinking trajectories robustly, especially in
the emerging setting of trajectory-response outputs generated by frontier
reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a
novel trajectory-aware PRM explicitly designed to evaluate the
trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both
step-level and trajectory-level supervision, enabling fine-grained reward
assignment aligned with structured chain-of-thought data. We adapt
ReasonFlux-PRM to support reward supervision under both offline and online
settings, including (i) selecting high-quality model distillation data for
downstream supervised fine-tuning of smaller models, (ii) providing dense
process-level rewards for policy optimization during reinforcement learning,
and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results
on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond
demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs
(e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our
derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving
average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement
learning, and 6.3% in test-time scaling. We also release our efficient
ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment.
Projects: https://github.com/Gen-Verse/ReasonFlux

Source link

What's Hot

Tesla faces new blockade in Sweden as IF Metall escalates dispute

Eve – AI-Driven Client Intake – Artificial Lawyer

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km – Takara TLDR

Paper page – ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km – Takara TLDR

StreamingVLM: Real-Time Understanding for Infinite Video Streams – Takara TLDR

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents – Takara TLDR

Toledo Museum of Art Director on Digital Art, AI, and Future-Proofing

Smithsonian Closes Museums Amid Government Shutdown

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Tesla faces new blockade in Sweden as IF Metall escalates dispute

Eve – AI-Driven Client Intake – Artificial Lawyer

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km – Takara TLDR

What's Hot

Paper page – ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Related Posts

Subscribe to Updates