Paper Page - A Minimalist Approach To LLM Reasoning: From Rejection Sampling To Reinforce

Reinforcement learning (RL) has become a prevailing approach for fine-tuning
large language models (LLMs) on complex reasoning tasks. Among recent methods,
GRPO stands out for its empirical success in training models such as
DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In
this work, we revisit GRPO from a reinforce-like algorithm perspective and
analyze its core components. Surprisingly, we find that a simple rejection
sampling baseline, RAFT, which trains only on positively rewarded samples,
yields competitive performance than GRPO and PPO. Our ablation studies reveal
that GRPO’s main advantage arises from discarding prompts with entirely
incorrect responses, rather than from its reward normalization. Motivated by
this insight, we propose Reinforce-Rej, a minimal extension of policy gradient
that filters both entirely incorrect and entirely correct samples.
Reinforce-Rej improves KL efficiency and stability, serving as a lightweight
yet effective alternative to more complex RL algorithms. We advocate RAFT as a
robust and interpretable baseline, and suggest that future advances should
focus on more principled designs for incorporating negative samples, rather
than relying on them indiscriminately. Our findings provide guidance for future
work in reward-based LLM post-training.

Source link

What's Hot

Cohere, Ottawa sign non-binding agreement on government AI uses

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning – Takara TLDR

How Infosys built a generative AI solution to process oil and gas drilling data with Amazon Bedrock

Paper page – A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning – Takara TLDR

4DNeX: Feed-Forward 4D Generative Modeling Made Easy – Takara TLDR

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models – Takara TLDR

Barbara Hepworth Sculpture Will Remain in UK After £3.8 M. Raised

After 12-Year Hiatus, Egypt’s Alexandria Biennale Will Return

Senator Seeks Investigation into Jeffrey Epstein’s Work for Leon Black

Spike Lee’s ‘Highest 2 Lowest’ Features Art From His Own Collection

Cohere, Ottawa sign non-binding agreement on government AI uses

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning – Takara TLDR

How Infosys built a generative AI solution to process oil and gas drilling data with Amazon Bedrock

What's Hot

Paper page – A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

Related Posts

Subscribe to Updates