Pass@k Training For Adaptively Balancing Exploration And Exploitation Of Large Reasoning Models - Takara TLDR

Reinforcement learning with verifiable rewards (RLVR), which typically adopts
Pass@1 as the reward, has faced the issues in balancing exploration and
exploitation, causing policies to prefer conservative actions, converging to a
local optimum. Identifying an appropriate reward metric is therefore crucial.
Regarding the prior work, although Pass@k has been used in evaluation, its
connection to LLM exploration ability in RLVR remains largely overlooked. To
investigate this, we first use Pass@k as the reward to train the policy model
(i.e., $\textbf{Pass@k Training}$), and observe the improvement on its
exploration ability. Next, we derive an analytical solution for the advantage
of Pass@k Training, leading to an efficient and effective process. Building on
this, our analysis reveals that exploration and exploitation are not inherently
conflicting objectives, while they can mutually enhance each other. Moreover,
Pass@k Training with analytical derivation essentially involves directly
designing the advantage function. Inspired by this, we preliminarily explore
the advantage design for RLVR, showing promising results and highlighting a
potential future direction.

Source link

What's Hot

DeepSeek AI Models Are Unsafe and Unreliable, Finds NIST-Backed Study

MIT arrests 10 in Istanbul operation targeting organized cybercrime

OpenAI Stargate data center buildout infrastructure lead Keith Heyde

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models – Takara TLDR

Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models – Takara TLDR

VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning – Takara TLDR

TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis – Takara TLDR

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

Almine Rech Closes London Gallery After More Than a Decade

Record Exec and Art Collector Gets Over 4 Years

DeepSeek AI Models Are Unsafe and Unreliable, Finds NIST-Backed Study

MIT arrests 10 in Istanbul operation targeting organized cybercrime

OpenAI Stargate data center buildout infrastructure lead Keith Heyde

What's Hot

Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models – Takara TLDR

Related Posts

Subscribe to Updates