Paper Page - Absolute Zero: Reinforced Self-play Reasoning With Zero Data

Reinforcement learning with verifiable rewards (RLVR) has shown promise in
enhancing the reasoning capabilities of large language models by learning
directly from outcome-based rewards. Recent RLVR works that operate under the
zero setting avoid supervision in labeling the reasoning process, but still
depend on manually curated collections of questions and answers for training.
The scarcity of high-quality, human-produced examples raises concerns about the
long-term scalability of relying on human supervision, a challenge already
evident in the domain of language model pretraining. Furthermore, in a
hypothetical future where AI surpasses human intelligence, tasks provided by
humans may offer limited learning potential for a superintelligent system. To
address these concerns, we propose a new RLVR paradigm called Absolute Zero, in
which a single model learns to propose tasks that maximize its own learning
progress and improves reasoning by solving them, without relying on any
external data. Under this paradigm, we introduce the Absolute Zero Reasoner
(AZR), a system that self-evolves its training curriculum and reasoning ability
by using a code executor to both validate proposed code reasoning tasks and
verify answers, serving as an unified source of verifiable reward to guide
open-ended yet grounded learning. Despite being trained entirely without
external data, AZR achieves overall SOTA performance on coding and mathematical
reasoning tasks, outperforming existing zero-setting models that rely on tens
of thousands of in-domain human-curated examples. Furthermore, we demonstrate
that AZR can be effectively applied across different model scales and is
compatible with various model classes.

Source link

What's Hot

Education report calling for ethical AI use contains over 15 fake sources

Britannica Group sues Perplexity AI over online summaries

Nicholas Galanin Pulls Out of Smithsonian Event, Claiming Censorship

Paper page – Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Research Paper – Takara TLDR

2D Gaussian Splatting with Semantic Alignment for Image Inpainting – Takara TLDR

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward – Takara TLDR

Nicholas Galanin Pulls Out of Smithsonian Event, Claiming Censorship

Long-Lost Painting By Rubens From 1613 Discovered in Paris Mansion

Ken Griffin Loves Pollock’s Blue Poles So Much He Tried to Buy it

Nan Goldin Says Her Market ‘Tanked’ Due to Palestine Activism

Education report calling for ethical AI use contains over 15 fake sources

Britannica Group sues Perplexity AI over online summaries

Nicholas Galanin Pulls Out of Smithsonian Event, Claiming Censorship

What's Hot

Paper page – Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Related Posts

Subscribe to Updates