PVPO: Pre-Estimated Value-Based Policy Optimization For Agentic Reasoning - Takara TLDR

Critic-free reinforcement learning methods, particularly group policies, have
attracted considerable attention for their efficiency in complex tasks.
However, these methods rely heavily on multiple sampling and comparisons within
the policy to estimate advantage, which may cause the policy to fall into local
optimum and increase computational cost. To address these issues, we propose
PVPO, an efficient reinforcement learning method enhanced by an advantage
reference anchor and data pre-sampling. Specifically, we use the reference
model to rollout in advance and employ the calculated reward score as a
reference anchor. Our approach effectively corrects the cumulative bias
introduced by intra-group comparisons and significantly reduces reliance on the
number of rollouts. Meanwhile, the reference model can assess sample difficulty
during data pre-sampling, enabling effective selection of high-gain data to
improve training efficiency. Experiments conducted on nine datasets across two
domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our
approach not only demonstrates robust generalization across multiple tasks, but
also exhibits scalable performance across models of varying scales.

Source link

What's Hot

Buy, Sell or Hold the Stock?

Natural language-based database analytics with Amazon Nova

Bosch, Alibaba bolster AI tie-up

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning – Takara TLDR

How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench – Takara TLDR

No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes – Takara TLDR

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code – Takara TLDR

80 Museum Exhibitions and Biennials to See in Fall 2025

Woodmere Art Museum Sues Trump Administration Over Canceled IMLS Grant

Barbara Gladstone’s Chelsea Townhouse in NYC Sells for $13.1 M.

Trump Meets with Smithsonian Leader Amid Threats of Content Review

Buy, Sell or Hold the Stock?

Natural language-based database analytics with Amazon Nova

Bosch, Alibaba bolster AI tie-up

What's Hot

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning – Takara TLDR

Related Posts

Subscribe to Updates