Paper page - Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Critique-GRPO, an RL framework combining numerical and natural language feedback, enhances LLM reasoning across tasks and outperforms existing methods.

Recent advances in reinforcement learning (RL) with numerical feedback, such
as scalar rewards, have significantly enhanced the complex reasoning
capabilities of large language models (LLMs). Despite this success, we identify
three key challenges encountered by RL with solely numerical feedback:
performance plateaus, limited effectiveness of self-reflection, and persistent
failures. We then demonstrate that RL-finetuned models, even after exhibiting
performance plateaus, can generate correct refinements on persistently failed
problems by leveraging natural language feedback in the form of critiques.
Building on this insight, we propose Critique-GRPO, an online RL framework that
integrates both natural language and numerical feedback for effective policy
optimization. Critique-GRPO enables LLMs to learn from initial responses and
critique-guided refinements simultaneously while maintaining exploration.
Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that
Critique-GRPO consistently outperforms supervised learning-based and RL-based
fine-tuning approaches across eight challenging mathematical, STEM, and general
reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%,
respectively. Notably, Critique-GRPO surpasses a strong baseline that
incorporates expert demonstrations within online RL. Further analysis reveals
two critical insights about policy exploration: (1) higher entropy does not
always guarantee efficient learning from exploration, and (2) longer responses
do not necessarily lead to more effective exploration.

Source link

What's Hot

Alibaba Launches Qwen3-Coder AI Model for Agentic Programming Excellence

China’s Underground Market for Nvidia AI Chip Repairs Surges Amid U.S. Export Ban

Classroom platform Canvas getting more AI features, courtesy of OpenAI

Paper page – Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Paper page – Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning

Paper page – Elevating 3D Models: High-Quality Texture and Geometry Refinement from a Low-Quality Model

Paper page – Pixels, Patterns, but No Poetry: To See The World like Humans

US Appeals Court Overturns $8.8 M. Trademark Judgement For Yuga Labs

Old Masters ‘Making a Comeback’ in London: Morning Links

Bill Proposed To Apply Anti-Money Laundering Regulations to Art Market

France’s Culture Minister to Go on Trial for Corruption

Alibaba Launches Qwen3-Coder AI Model for Agentic Programming Excellence

China’s Underground Market for Nvidia AI Chip Repairs Surges Amid U.S. Export Ban

Classroom platform Canvas getting more AI features, courtesy of OpenAI

What's Hot

Paper page – Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Related Posts

Subscribe to Updates