Scaling Up Multi-Turn Off-Policy RL And Multi-Agent Tree Search For LLM Step-Provers - Takara TLDR

The integration of Large Language Models (LLMs) into automated theorem
proving has shown immense promise, yet is fundamentally constrained by
challenges in scaling up both training-time reinforcement learning (RL) and
inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system
designed to address this dual scaling problem. We present two primary
innovations. The first is a novel multi-turn off-policy RL framework for
continually improving the performance of LLM step-prover at training time. This
framework, inspired by the principles of AlphaZero, utilizes a multi-stage
expert iteration pipeline featuring adaptive tactic-level data filtering and
periodic retraining to surmount the performance plateaus that typically curtail
long-term RL in LLM-based agents. The second innovation is a planner-enhanced
multi-agent search architecture that scales reasoning capabilities at inference
time. This architecture employs a general reasoning model as a high-level
planner to iteratively decompose complex theorems into a sequence of simpler
subgoals. This hierarchical approach substantially reduces the search space,
enabling a team of parallel prover agents to collaborate efficiently by
leveraging a shared proof cache. We demonstrate that this dual approach to
scaling yields state-of-the-art results on established formal mathematics
benchmarks. \texttt{BFS-Prover-V2} achieves 95.08\% and 41.4\% on the MiniF2F
and ProofNet test sets respectively. While demonstrated in the domain of formal
mathematics, the RL and inference techniques presented in this work are of
broader interest and may be applied to other domains requiring long-horizon
multi-turn reasoning and complex search.

Source link

What's Hot

Sources: AI training startup Mercor eyes $10B+ valuation on $450 million run rate

Mistral and ASML forge €1.7bn alliance to shape Europe’s AI future

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? – Takara TLDR

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers – Takara TLDR

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? – Takara TLDR

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts – Takara TLDR

Interleaving Reasoning for Better Text-to-Image Generation – Takara TLDR

Leon Black and Leslie Wexner’s Letters to Jeffrey Epstein Released

Anne Imhof Reimagines Football Jerseys with Nike

Jason Wu, Robert Rauschenberg Collaboration for New York Fashion Week

Storied Collector and MoMA Trustee Dies at 92

Sources: AI training startup Mercor eyes $10B+ valuation on $450 million run rate

Mistral and ASML forge €1.7bn alliance to shape Europe’s AI future

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? – Takara TLDR

What's Hot

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers – Takara TLDR

Related Posts

Subscribe to Updates