Paper page - CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

Recent advances in rule-based reinforcement learning (RL) have significantly
improved the reasoning capability of language models (LMs) with rule-based
rewards. However, existing RL methods — such as GRPO, REINFORCE++, and RLOO —
often suffer from training instability, where large policy updates and improper
clipping can lead to training collapse. To address this issue, we propose
Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel
algorithm designed to stabilize policy learning in LMs. CPGD introduces a
policy drift constraint based on KL divergence to dynamically regularize policy
updates, and leverages a clip mechanism on the logarithm of the ratio to
prevent excessive policy updates. We provide theoretical justification for CPGD
and demonstrate through empirical analysis that it mitigates the instability
observed in prior approaches. Furthermore, we show that CPGD significantly
improves performance while maintaining training stability. Our implementation
balances theoretical rigor with practical usability, offering a robust
alternative for RL in the post-training of LMs. We release our code at
https://github.com/ModalMinds/MM-EUREKA.

Source link

What's Hot

The latest Google Gemma AI model can run on phones

MGX, Bpifrance, Nvidia, and Mistral AI plan 1.4GW Paris data center campus

How is China’s DeepSeek changing AI landscape for US tech?

Paper page – CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

Paper page – Hybrid 3D-4D Gaussian Splatting for Fast Dynamic Scene Representation

Paper page – Neuro-Symbolic Query Compiler

AI-Driven Scholarly Peer Review via Persistent Workflow Prompting, Meta-Prompting, and Meta-Reasoning

Fortnite’s Darth Vader Controversy, Explained

How Exclusion Inspired Malene Barnett’s ‘Crafted Kinship,’ A Groundbreaking Book Celebrating Caribbean Makers

Calder Gardens in Philadelphia to Open September 21

Estate of Susan Rothenberg Joins Hauser & Wirth

The latest Google Gemma AI model can run on phones

MGX, Bpifrance, Nvidia, and Mistral AI plan 1.4GW Paris data center campus

How is China’s DeepSeek changing AI landscape for US tech?

What's Hot

Paper page – CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

Related Posts

Subscribe to Updates