View a PDF of the paper titled When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models, by Sunny Sanyal and 3 other authors
View PDF
HTML (experimental)
Abstract:Large Language Models (LLMs) rely on the transformer architecture and its self-attention mechanism to deliver strong performance across tasks. However, we uncover a structural inefficiency in standard pre-trained decoder-style LLMs: in many of the deeper layers, attention matrices frequently collapse to near rank-one, single-column patterns. We refer to these underutilized components as lazy layers, which are redundant and computationally inefficient. To address this, we propose Inheritune, a simple and effective training recipe for building smaller, more efficient, and high performing language models. Inheritune initializes a compact model by inheriting the useful early layers from a larger pre-trained model, then progressively retrains and expands it. Our experiments across multiple models and datasets show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts. This approach yields compact, performant models and offers a practical path for efficient language model compression. Code is available at this https URL
Submission history
From: Sunny Sanyal [view email]
[v1]
Fri, 12 Apr 2024 17:53:34 UTC (107 KB)
[v2]
Fri, 4 Oct 2024 05:14:48 UTC (1,652 KB)
[v3]
Sun, 8 Jun 2025 09:19:32 UTC (411 KB)