Attention As A Compass: Efficient Exploration For Process-Supervised RL In Reasoning Models - Takara TLDR

Reinforcement Learning (RL) has shown remarkable success in enhancing the
reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL
(PSRL) has emerged as a more effective paradigm compared to outcome-based RL.
However, existing PSRL approaches suffer from limited exploration efficiency,
both in terms of branching positions and sampling. In this paper, we introduce
a novel PSRL framework (AttnRL), which enables efficient exploration for
reasoning models. Motivated by preliminary observations that steps exhibiting
high attention scores correlate with reasoning behaviors, we propose to branch
from positions with high values. Furthermore, we develop an adaptive sampling
strategy that accounts for problem difficulty and historical batch size,
ensuring that the whole training batch maintains non-zero advantage values. To
further improve sampling efficiency, we design a one-step off-policy training
pipeline for PSRL. Extensive experiments on multiple challenging mathematical
reasoning benchmarks demonstrate that our method consistently outperforms prior
approaches in terms of performance and sampling and training efficiency.

Source link

What's Hot

OpenAI launches Sora 2 with TikTok-style app

MIT Cognitive Scientists Reveal Why Some Sentences Stand Out From Others

Character.AI in the spotlight with Karandeep Anand at Disrupt 2025

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models – Takara TLDR

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones – Takara TLDR

MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech – Takara TLDR

Pretraining Large Language Models with NVFP4 – Takara TLDR

Smithsonian Museums to Remain Open Amid Government Shutdown

Statue Left Behind by Grave Robbers Unearthed in Saqqara, Egypt

Security Guards Accuse de Young Museum of Abusive Workplace Culture

Federal Judge Denies Motion to Dismiss by Kasseem ‘Swizz Beatz’ Dean in 1MBD Scandal Case

OpenAI launches Sora 2 with TikTok-style app

MIT Cognitive Scientists Reveal Why Some Sentences Stand Out From Others

Character.AI in the spotlight with Karandeep Anand at Disrupt 2025

What's Hot

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models – Takara TLDR

Related Posts

Subscribe to Updates