Quantile Advantage Estimation For Entropy-Safe Reasoning - Takara TLDR

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM
reasoning, but training often oscillates between {entropy collapse} and
{entropy explosion}. We trace both hazards to the mean baseline used in
value-free RL (e.g., GRPO and DAPO), which improperly penalizes
negative-advantage samples under reward outliers. We propose {Quantile
Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile
baseline. QAE induces a response-level, two-regime gate: on hard queries (p <=
1 – K) it reinforces rare successes, while on easy queries (p > 1 – K) it
targets remaining failures. Under first-order softmax updates, we prove
{two-sided entropy safety}, giving lower and upper bounds on one-step entropy
change that curb explosion and prevent collapse. Empirically, this minimal
modification stabilizes entropy, sparsifies credit assignment (with tuned K,
roughly 80% of responses receive zero advantage), and yields sustained pass@1
gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results
identify {baseline design} — rather than token-level heuristics — as the
primary mechanism for scaling RLVR.

Source link

What's Hot

OpenAI launches new parental controls to target graphic content

AI recruiter Alex raises $17M to automate initial job interviews

Converting inbound leads into customers at OpenAI

Quantile Advantage Estimation for Entropy-Safe Reasoning – Takara TLDR

LongLive: Real-time Interactive Long Video Generation – Takara TLDR

SPARK: Synergistic Policy And Reward Co-Evolving Framework – Takara TLDR

StateX: Enhancing RNN Recall via Post-training State Expansion – Takara TLDR

MSN Warsaw Director Joanna Mytkowska on Museums in Times of Change

Nara Painting Heads to Christie’s London After Recent Sotheby’s Test

Fiat Family Faces New Allegations of Missing Artworks and Forgeries

Greek Police Arrest Abbot on Charges of Art Trafficking, and More: Morning Links

OpenAI launches new parental controls to target graphic content

AI recruiter Alex raises $17M to automate initial job interviews

Converting inbound leads into customers at OpenAI

What's Hot

Quantile Advantage Estimation for Entropy-Safe Reasoning – Takara TLDR

Related Posts

Subscribe to Updates