Paper page - AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

Recently, there has been growing interest in collecting reasoning-intensive
pretraining data to improve LLMs’ complex reasoning ability. Prior approaches
typically rely on supervised classifiers to identify such data, which requires
labeling by humans or LLMs, often introducing domain-specific biases. Due to
the attention heads being crucial to in-context reasoning, we propose
AttentionInfluence, a simple yet effective, training-free method without
supervision signal. Our approach enables a small pretrained language model to
act as a strong data selector through a simple attention head masking
operation. Specifically, we identify retrieval heads and compute the loss
difference when masking these heads. We apply AttentionInfluence to a
1.3B-parameter dense model to conduct data selection on the SmolLM corpus of
241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B
tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD
learning rate scheduling. Our experimental results demonstrate substantial
improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive
and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and
HumanEval). This demonstrates an effective weak-to-strong scaling property,
with small models improving the final performance of larger models-offering a
promising and scalable path for reasoning-centric data selection.

Source link

What's Hot

Full STEM ahead: Parkersburg Catholic’s Helena Teltscher will fuel her passion for problem solving, engineering at MIT | News, Sports, Jobs

TU Wien Rendering #8 – Surface Normals

Paper page – Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Paper page – AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

Paper page – Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Paper page – PyVision: Agentic Vision with Dynamic Tooling

Paper page – Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Homeland Security Targets Chicago’s National Museum of Puerto Rican Arts & Culture

1,600-Year-Old Tomb of Mayan City’s Founding King Discovered in Belize

Centre Pompidou Cancels Caribbean Art Show, Raising Controversy

‘Night at the Museum’ Reboot in the Works

Full STEM ahead: Parkersburg Catholic’s Helena Teltscher will fuel her passion for problem solving, engineering at MIT | News, Sports, Jobs

TU Wien Rendering #8 – Surface Normals

Paper page – Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

What's Hot

Paper page – AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

Related Posts

Subscribe to Updates