Thinking Augmented Pre-training - Takara TLDR

This paper introduces a simple and scalable approach to improve the data
efficiency of large language model (LLM) training by augmenting existing text
data with thinking trajectories. The compute for pre-training LLMs has been
growing at an unprecedented rate, while the availability of high-quality data
remains limited. Consequently, maximizing the utility of available data
constitutes a significant research challenge. A primary impediment is that
certain high-quality tokens are difficult to learn given a fixed model
capacity, as the underlying rationale for a single token can be exceptionally
complex and deep. To address this issue, we propose Thinking augmented
Pre-Training (TPT), a universal methodology that augments text with
automatically generated thinking trajectories. Such augmentation effectively
increases the volume of the training data and makes high-quality tokens more
learnable through step-by-step reasoning and decomposition. We apply TPT across
diverse training configurations up to $100$B tokens, encompassing pre-training
with both constrained and abundant data, as well as mid-training from strong
open-source checkpoints. Experimental results indicate that our method
substantially improves the performance of LLMs across various model sizes and
families. Notably, TPT enhances the data efficiency of LLM pre-training by a
factor of $3$. For a $3$B parameter model, it improves the post-training
performance by over $10\%$ on several challenging reasoning benchmarks.

Source link

What's Hot

“The layoffs at Fiverr are just the beginning”: AI is coming for white-collar work

V-GameGym: Visual Game Generation for Code Large Language Models – Takara TLDR

OpenAI shows off Stargate data center in Texas | The Arkansas Democrat-Gazette

Thinking Augmented Pre-training – Takara TLDR

V-GameGym: Visual Game Generation for Code Large Language Models – Takara TLDR

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity – Takara TLDR

Seedream 4.0: Toward Next-generation Multimodal Image Generation – Takara TLDR

Judge Rejects Ronald Perelman’s $400 M. Art Insurance Claim

Drag Queen Alexis Stone Became the Mona Lisa for Milan Fashion Show

Steve McQueen’s Granddaughter Lawsuit for $68 M. Pollock Painting

Marina Abramović to Have Exhibition at Venice’s Accademia in 2026

“The layoffs at Fiverr are just the beginning”: AI is coming for white-collar work

V-GameGym: Visual Game Generation for Code Large Language Models – Takara TLDR

OpenAI shows off Stargate data center in Texas | The Arkansas Democrat-Gazette

What's Hot

Thinking Augmented Pre-training – Takara TLDR

Related Posts

Subscribe to Updates