Learn The Ropes, Then Trust The Wins: Self-imitation With Progressive Exploration For Agentic Reinforcement Learning - Takara TLDR

Reinforcement learning (RL) is the dominant paradigm for sharpening strategic
tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks,
yet it faces a fundamental challenge of exploration-exploitation trade-off.
Existing studies stimulate exploration through the lens of policy entropy, but
such mechanical entropy maximization is prone to RL training instability due to
the multi-turn distribution shifting. In this paper, we target the progressive
exploration-exploitation balance under the guidance of the agent own
experiences without succumbing to either entropy collapsing or runaway
divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL)
recipe for training agentic LLMs. It extends the vanilla SIL framework, where a
replay buffer stores self-generated promising trajectories for off-policy
update, by gradually steering the policy evolution within a well-balanced range
of entropy across stages. Specifically, our approach incorporates a curriculum
to manage the exploration process, utilizing intrinsic rewards to foster
skill-level exploration and facilitating action-level exploration through SIL.
At first, the auxiliary tool call reward plays a critical role in the
accumulation of tool-use skills, enabling broad exposure to the unfamiliar
distributions of the environment feedback with an upward entropy trend. As
training progresses, self-imitation gets strengthened to exploit existing
successful patterns from replayed experiences for comparative action-level
exploration, accelerating solution iteration without unbounded entropy growth.
To further stabilize training, we recalibrate the advantages of experiences in
the replay buffer to address the potential policy drift. Reugularizations such
as the clipping of tokens with high covariance between probability and
advantage are introduced to the trajectory-level entropy control to curb
over-confidence.

Source link

What's Hot

OpenAI Is Preparing to Launch a Social App for AI-Generated Videos

HSBC Posts 34% Gains with IBM Heron Quantum Processors

Vibe-coding startup Anything nabs a $100M valuation after hitting $2M ARR in its first two weeks

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning – Takara TLDR

Quantile Advantage Estimation for Entropy-Safe Reasoning – Takara TLDR

LongLive: Real-time Interactive Long Video Generation – Takara TLDR

SPARK: Synergistic Policy And Reward Co-Evolving Framework – Takara TLDR

MSN Warsaw Director Joanna Mytkowska on Museums in Times of Change

Nara Painting Heads to Christie’s London After Recent Sotheby’s Test

Fiat Family Faces New Allegations of Missing Artworks and Forgeries

Researchers Identify the Oldest Blue Pigment Found in Europe

OpenAI Is Preparing to Launch a Social App for AI-Generated Videos

HSBC Posts 34% Gains with IBM Heron Quantum Processors

Vibe-coding startup Anything nabs a $100M valuation after hitting $2M ARR in its first two weeks

What's Hot

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning – Takara TLDR

Related Posts

Subscribe to Updates