Implicit Actor Critic Coupling Via A Supervised Learning Framework For RLVR - Takara TLDR

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have
empowered large language models (LLMs) to tackle challenging reasoning tasks
such as mathematics and programming. RLVR leverages verifiable outcome rewards
to guide policy optimization, enabling LLMs to progressively improve output
quality in a grounded and reliable manner. Despite its promise, the RLVR
paradigm poses significant challenges, as existing methods often suffer from
sparse reward signals and unstable policy gradient updates, particularly in
RL-based approaches. To address the challenges, we propose $\textbf{PACS}$, a
novel RLVR framework that achieves im$\textbf{P}$licit $\textbf{A}$ctor
$\textbf{C}$ritic coupling via a $\textbf{S}$upervised learning framework. By
treating the outcome reward as a predictable label, we reformulate the RLVR
problem into a supervised learning task over a score function parameterized by
the policy model and optimized using cross-entropy loss. A detailed gradient
analysis shows that this supervised formulation inherently recovers the
classical policy gradient update while implicitly coupling actor and critic
roles, yielding more stable and efficient training. Benchmarking on challenging
mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as
PPO and GRPO, achieving superior reasoning performance. For instance, PACS
achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32
and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a
promising avenue for LLMs post-training with verifiable rewards. Our code and
data are available as open source at https://github.com/ritzz-ai/PACS.

Source link

What's Hot

Free Mark Cuban Foundation AI Bootcamp Coming to Philadelphia This Fall

C3.ai’s stock crumbles on ‘unacceptable’ results, but new CEO promises to turn things around

Open Data Synthesis For Deep Research – Takara TLDR

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR – Takara TLDR

Open Data Synthesis For Deep Research – Takara TLDR

M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision – Takara TLDR

Improving Large Vision and Language Models by Learning from a Panel of Peers – Takara TLDR

Nazi-Looted Painting from Argentine Home May Have Been Recovered

Moche Residence Unearthed at Archaeological Site in Northern Peru

Kim Sajet to Helm the Milwaukee Art Museum

Armory Show to ‘Complicate Stereotypes,’ and More Art News

Free Mark Cuban Foundation AI Bootcamp Coming to Philadelphia This Fall

C3.ai’s stock crumbles on ‘unacceptable’ results, but new CEO promises to turn things around

Open Data Synthesis For Deep Research – Takara TLDR

What's Hot

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR – Takara TLDR

Related Posts

Subscribe to Updates