Reinforcement Learning On Pre-Training Data - Takara TLDR

The growing disparity between the exponential scaling of computational
resources and the finite growth of high-quality text data now constrains
conventional scaling approaches for large language models (LLMs). To address
this challenge, we introduce Reinforcement Learning on Pre-Training data
(RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast
to prior approaches that scale training primarily through supervised learning,
RLPT enables the policy to autonomously explore meaningful trajectories to
learn from pre-training data and improve its capability through reinforcement
learning (RL). While existing RL strategies such as reinforcement learning from
human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR)
rely on human annotation for reward construction, RLPT eliminates this
dependency by deriving reward signals directly from pre-training data.
Specifically, it adopts a next-segment reasoning objective, rewarding the
policy for accurately predicting subsequent text segments conditioned on the
preceding context. This formulation allows RL to be scaled on pre-training
data, encouraging the exploration of richer trajectories across broader
contexts and thereby fostering more generalizable reasoning skills. Extensive
experiments on both general-domain and mathematical reasoning benchmarks across
multiple models validate the effectiveness of RLPT. For example, when applied
to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$,
$6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and
AIME25, respectively. The results further demonstrate favorable scaling
behavior, suggesting strong potential for continued gains with more compute. In
addition, RLPT provides a solid foundation, extending the reasoning boundaries
of LLMs and enhancing RLVR performance.

Source link

What's Hot

Neon, the No. 2 social app on the Apple App Store, pays users to record their phone calls and sells data to AI firms

Canadian A.I. Startup Cohere Valued at $7B After Raising Another $100M

Perplexity Comet AI web browser launches in India with a catch: Check how to download, setup and more – Technology News

Reinforcement Learning on Pre-Training Data – Takara TLDR

Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications – Takara TLDR

DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models’ Understanding on Indian Culture – Takara TLDR

What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT – Takara TLDR

Art Dealer Mary Boone Says Prison Was ‘Very Relaxing’

New Research Supports Theory of Hidden Vermeer Self-Portrait

John Singer Sargent Paintings Expected to Bring In $12-15 Million

John Giorno’s Decades-Long Project Dial-A-Poem Is Now Online

Neon, the No. 2 social app on the Apple App Store, pays users to record their phone calls and sells data to AI firms

Canadian A.I. Startup Cohere Valued at $7B After Raising Another $100M

Perplexity Comet AI web browser launches in India with a catch: Check how to download, setup and more – Technology News

What's Hot

Reinforcement Learning on Pre-Training Data – Takara TLDR

Related Posts

Subscribe to Updates