Paper Page - Truncated Proximal Policy Optimization

T-PPO, an extension of PPO, improves training efficiency for Large Language Models by optimizing policy updates and utilizing hardware resources more effectively.

Recently, test-time scaling Large Language Models (LLMs) have demonstrated
exceptional reasoning capabilities across scientific and professional tasks by
generating long chains-of-thought (CoT). As a crucial component for developing
these reasoning models, reinforcement learning (RL), exemplified by Proximal
Policy Optimization (PPO) and its variants, allows models to learn through
trial and error. However, PPO can be time-consuming due to its inherent
on-policy nature, which is further exacerbated by increasing response lengths.
In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a
novel extension to PPO that improves training efficiency by streamlining policy
update and length-restricted response generation. T-PPO mitigates the issue of
low hardware utilization, an inherent drawback of fully synchronized
long-generation procedures, where resources often sit idle during the waiting
periods for complete rollouts. Our contributions are two-folds. First, we
propose Extended Generalized Advantage Estimation (EGAE) for advantage
estimation derived from incomplete responses while maintaining the integrity of
policy learning. Second, we devise a computationally optimized mechanism that
allows for the independent optimization of the policy and value models. By
selectively filtering prompt and truncated tokens, this mechanism reduces
redundant computations and accelerates the training process without sacrificing
convergence performance. We demonstrate the effectiveness and efficacy of T-PPO
on AIME 2024 with a 32B base model. The experimental results show that T-PPO
improves the training efficiency of reasoning LLMs by up to 2.5x and
outperforms its existing competitors.

Source link

What's Hot

Competition heats up to challenge Nvidia’s AI chip dominance

Anthropic’s Claude AI can now automatically ‘remember’ past chats

Tencent’s AI model Hunyuan Image 3.0 tops leaderboard, beating Google’s Nano Banana

Paper page – Truncated Proximal Policy Optimization

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents – Takara TLDR

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping – Takara TLDR

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation – Takara TLDR

Former ARTnews Publisher Dies at 97

National Gallery of Art Closes as a Result of Government Shutdown

Almine Rech Closes London Gallery After More Than a Decade

Record Exec and Art Collector Gets Over 4 Years

Competition heats up to challenge Nvidia’s AI chip dominance

Anthropic’s Claude AI can now automatically ‘remember’ past chats

Tencent’s AI model Hunyuan Image 3.0 tops leaderboard, beating Google’s Nano Banana

What's Hot

Paper page – Truncated Proximal Policy Optimization

Related Posts

Subscribe to Updates