Paper Page - Think-RM: Enabling Long-Horizon Reasoning In Generative Reward Models

Think-RM is a framework that enhances generative reward models with long-horizon reasoning and a novel pairwise RLHF pipeline to improve end-policy performance in aligning large language models with human preferences.

Reinforcement learning from human feedback (RLHF) has become a powerful
post-training paradigm for aligning large language models with human
preferences. A core challenge in RLHF is constructing accurate reward signals,
where the conventional Bradley-Terry reward models (BT RMs) often suffer from
sensitivity to data size and coverage, as well as vulnerability to reward
hacking. Generative reward models (GenRMs) offer a more robust alternative by
generating chain-of-thought (CoT) rationales followed by a final reward.
However, existing GenRMs rely on shallow, vertically scaled reasoning, limiting
their capacity to handle nuanced or complex (e.g., reasoning-intensive) tasks.
Moreover, their pairwise preference outputs are incompatible with standard RLHF
algorithms that require pointwise reward signals. In this work, we introduce
Think-RM, a training framework that enables long-horizon reasoning in GenRMs by
modeling an internal thinking process. Rather than producing structured,
externally provided rationales, Think-RM generates flexible, self-guided
reasoning traces that support advanced capabilities such as self-reflection,
hypothetical reasoning, and divergent reasoning. To elicit these reasoning
abilities, we first warm-up the models by supervised fine-tuning (SFT) over
long CoT data. We then further improve the model’s long-horizon abilities by
rule-based reinforcement learning (RL). In addition, we propose a novel
pairwise RLHF pipeline that directly optimizes policies using pairwise
preference rewards, eliminating the need for pointwise reward conversion and
enabling more effective use of Think-RM outputs. Experiments show that Think-RM
achieves state-of-the-art results on RM-Bench, outperforming both BT RM and
vertically scaled GenRM by 8%. When combined with our pairwise RLHF pipeline,
it demonstrates superior end-policy performance compared to traditional
approaches.

Source link

What's Hot

Hunyuan-MT Technical Report – Takara TLDR

Chips, Politics, and Europe’s AI Ambitions

Alibaba Unveils Trillion-Parameter Qwen AI Model

Paper page – Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Hunyuan-MT Technical Report – Takara TLDR

3D and 4D World Modeling: A Survey – Takara TLDR

EnvX: Agentize Everything with Agentic AI – Takara TLDR

National Gallery and Tate Have ‘Bad Blood’—and More Art News

Christie’s Will Auction The First Calculating Machine In History

The Art Market Isn’t Dying. The Way We Write About It Might Be.

Banksy Mural of Judge Beating Protestor Removed by Courts Service

Hunyuan-MT Technical Report – Takara TLDR

Chips, Politics, and Europe’s AI Ambitions

Alibaba Unveils Trillion-Parameter Qwen AI Model

What's Hot

Paper page – Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Related Posts

Subscribe to Updates