The Entropy Mechanism Of Reinforcement Learning For Reasoning Language Models

arXiv:2505.22617v1 Announce Type: cross
Abstract: This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.

Source link

What's Hot

Cognition AI defies turbulence with a $400M raise at $10.2B valuation

Build and scale adoption of AI agents for education with Strands Agents, Amazon Bedrock AgentCore, and LibreChat

ASML Puts $1.5B Into Mistral AI, Becomes Largest Shareholder

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

New Banksy Work at London’s Royal Courts Immediately Covered Up

John Pritzker Donates 188 Dada and Surrealist Works to the Met Museum

British Museum Says Bayeux Tapestry Is Safe—and More Art News

Tony Shafrazi and the Art of the Comeback

Cognition AI defies turbulence with a $400M raise at $10.2B valuation

Build and scale adoption of AI agents for education with Strands Agents, Amazon Bedrock AgentCore, and LibreChat

ASML Puts $1.5B Into Mistral AI, Becomes Largest Shareholder

What's Hot

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Related Posts

Subscribe to Updates