Tree Search For LLM Agent Reinforcement Learning - Takara TLDR

Recent advances in reinforcement learning (RL) have significantly enhanced
the agentic capabilities of large language models (LLMs). In long-term and
multi-turn agent tasks, existing approaches driven solely by outcome rewards
often suffer from the problem of sparse supervision. To address the challenge,
we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped
agent RL method based on tree search, where each tree node represents the
complete agent interaction step. By sharing common prefixes, the tree search
sampling increases the number of rollouts achievable within a fixed budget of
tokens or tool calls. Moreover, we find that the tree-structured trajectory
naturally allows the construction of step-wise process supervised signals even
using only the outcome reward. Based on this, Tree-GRPO estimates the grouped
relative advantages both on intra-tree and inter-tree levels. Through
theoretical analysis, we demonstrate that the objective of intra-tree level
group relative policy optimization is equivalent to that of step-level direct
preference learning. Experiments across 11 datasets and 3 types of QA tasks
demonstrate the superiority of the proposed tree-based RL over the chain-based
RL method.

Source link

What's Hot

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them – Takara TLDR

Get lifetime access to Sterling Stock Picker for just $55.19

SD Times news digest: Atom pull requests, MIT CSAIL’s depression recognition model, and dtSearch’s Intraspexion update

Tree Search for LLM Agent Reinforcement Learning – Takara TLDR

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them – Takara TLDR

Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say – Takara TLDR

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets – Takara TLDR

Lisa Phillips, Longtime Director of New York’s New Museum, to Retire

Submerged Port Discovery Offers Clues to Lost Tomb of Cleopatra

Forged Polish Painting Returns to the National Museum in Poznań

French Artist Invader Sues Julien Auctions Over Sale of Street Artworks

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them – Takara TLDR

Get lifetime access to Sterling Stock Picker for just $55.19

SD Times news digest: Atom pull requests, MIT CSAIL’s depression recognition model, and dtSearch’s Intraspexion update

What's Hot

Tree Search for LLM Agent Reinforcement Learning – Takara TLDR

Related Posts

Subscribe to Updates