Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

[Submitted on 7 Apr 2025 (v1), last revised 11 Apr 2025 (this version, v3)]

Authors:Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang, Yonghui Wu, Lin Yan

View a PDF of the paper titled VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks, by Yu Yue and 26 other authors

View PDF
HTML (experimental)

Abstract:We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.

Submission history

From: Yu Yue [view email]
[v1]
Mon, 7 Apr 2025 14:21:11 UTC (847 KB)
[v2]
Tue, 8 Apr 2025 03:06:22 UTC (847 KB)
[v3]
Fri, 11 Apr 2025 02:54:58 UTC (847 KB)

Source link

What's Hot

2 Artificial Intelligence (AI) Stocks With High Conviction

MIT’s bioinspired device mimics remora fish suction

Windsurf Engineer Details Exploding Google Offer

Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

David Geffen Sued By Estranged Husband for Breach of Contract

Auction House Will Sell Egyptian Artifact Despite Concern From Experts

Anish Kapoor Lists New York Apartment for $17.75 M.

Street Fighter 6 Community Rocked by AI Art Controversy

2 Artificial Intelligence (AI) Stocks With High Conviction

MIT’s bioinspired device mimics remora fish suction

Windsurf Engineer Details Exploding Google Offer

What's Hot

Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Submission history

Related Posts

Subscribe to Updates