Paper Page - Rethinking The Sampling Criteria In Reinforcement Learning For LLM Reasoning: A Competence-Difficulty Alignment Perspective

Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models, yet it is hard to scale for the low sample efficiency during the rollout phase. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To tackle these limitations, this paper introduces Competence-Difficulty Alignment Sampling (CDAS), which enables accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies of problems. Then the model competence is quantified to adaptively select problems whose difficulty is in alignment with the model’s current competence using a fixed-point system. Experimental results across a range of challenging mathematical benchmarks show that CDAS achieves great improvements in both accuracy and efficiency. CDAS attains the highest average accuracy against baselines and exhibits significant speed advantages compared to Dynamic Sampling, a competitive strategy in DAPO, which is 2.33 times slower than CDAS.

Source link

What's Hot

3D and 4D World Modeling: A Survey – Takara TLDR

How We Built A Unicorn Without Chasing Hype Cycles

Sources: AI training startup Mercor eyes $10B+ valuation on $450M run rate

Paper page – Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

3D and 4D World Modeling: A Survey – Takara TLDR

EnvX: Agentize Everything with Agentic AI – Takara TLDR

P3-SAM: Native 3D Part Segmentation – Takara TLDR

Christie’s Will Auction The First Calculating Machine In History

The Art Market Isn’t Dying. The Way We Write About It Might Be.

Banksy Mural of Judge Beating Protestor Removed by Courts Service

Death of Matthew Christopher Pietras Ruled a Suicide

3D and 4D World Modeling: A Survey – Takara TLDR

How We Built A Unicorn Without Chasing Hype Cycles

Sources: AI training startup Mercor eyes $10B+ valuation on $450M run rate

What's Hot

Paper page – Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective

Related Posts

Subscribe to Updates