Paper Page - ReasonFlux-PRM: Trajectory-Aware PRMs For Long Chain-of-Thought Reasoning In LLMs

ReasonFlux-PRM, a novel trajectory-aware Process Reward Model, evaluates reasoning traces with step-level and trajectory-level supervision, enhancing performance in model distillation, reinforcement learning, and test-time scaling.

Process Reward Models (PRMs) have recently emerged as a powerful framework
for supervising intermediate reasoning steps in large language models (LLMs).
Previous PRMs are primarily trained on model final output responses and
struggle to evaluate intermediate thinking trajectories robustly, especially in
the emerging setting of trajectory-response outputs generated by frontier
reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a
novel trajectory-aware PRM explicitly designed to evaluate the
trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both
step-level and trajectory-level supervision, enabling fine-grained reward
assignment aligned with structured chain-of-thought data. We adapt
ReasonFlux-PRM to support reward supervision under both offline and online
settings, including (i) selecting high-quality model distillation data for
downstream supervised fine-tuning of smaller models, (ii) providing dense
process-level rewards for policy optimization during reinforcement learning,
and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results
on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond
demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs
(e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our
derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving
average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement
learning, and 6.3% in test-time scaling. We also release our efficient
ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment.
Projects: https://github.com/Gen-Verse/ReasonFlux

Source link

What's Hot

Sales Plunge 19%! Mercedes Faces Hard Truth and Partners with ‘Doubao’, Can It Turn Things Around This Time?_market_the_’Doubao’

AI Integration Lags Behind the Hype – Artificial Lawyer

Twilio, Palantir Technologies, C3.ai, ZoomInfo, and AppLovin Shares Plummet, What You Need To Know

Paper page – ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

StreamingVLM: Real-Time Understanding for Infinite Video Streams – Takara TLDR

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents – Takara TLDR

Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense – Takara TLDR

Smithsonian Closes Museums Amid Government Shutdown

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Sales Plunge 19%! Mercedes Faces Hard Truth and Partners with ‘Doubao’, Can It Turn Things Around This Time?_market_the_’Doubao’

AI Integration Lags Behind the Hype – Artificial Lawyer

Twilio, Palantir Technologies, C3.ai, ZoomInfo, and AppLovin Shares Plummet, What You Need To Know

What's Hot

Paper page – ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Related Posts

Subscribe to Updates