Paper Page - A Systematic Analysis Of Hybrid Linear Attention

Research evaluates various linear attention models and their integration with full attention in Transformers, identifying key mechanisms like selective gating and hierarchical recurrence for enhanced recall performance.

Transformers face quadratic complexity and memory issues with long sequences,
prompting the adoption of linear attention mechanisms using fixed-size hidden
states. However, linear models often suffer from limited recall performance,
leading to hybrid architectures that combine linear and full attention layers.
Despite extensive hybrid architecture research, the choice of linear attention
component has not been deeply explored. We systematically evaluate various
linear attention models across generations – vector recurrences to advanced
gating mechanisms – both standalone and hybridized. To enable this
comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M
parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six
linear attention variants across five hybridization ratios. Benchmarking on
standard language modeling and recall tasks reveals that superior standalone
linear models do not necessarily excel in hybrids. While language modeling
remains stable across linear-to-full attention ratios, recall significantly
improves with increased full attention layers, particularly below a 3:1 ratio.
Our study highlights selective gating, hierarchical recurrence, and controlled
forgetting as critical for effective hybrid models. We recommend architectures
such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1
to achieve Transformer-level recall efficiently. Our models are open-sourced at
https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.

Source link

What's Hot

Why AI-Native CLM is Here to Stay – Artificial Lawyer

C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Object Detection – Takara TLDR

DeepSeek: what is it? – Telefónica

Paper page – A Systematic Analysis of Hybrid Linear Attention

C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Object Detection – Takara TLDR

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning – Takara TLDR

How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench – Takara TLDR

Search for Nazi-Looted Art Leads to House Arrest Order in Argentina

Louvre Ends Nintendo 3DS Museum Guide Partnership After Over A Decade

Musée d’Orsay President Dies of Heart Failure at 58

Lindsay Jarvis Makes a Bet on the Bowery

Why AI-Native CLM is Here to Stay – Artificial Lawyer

C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Object Detection – Takara TLDR

DeepSeek: what is it? – Telefónica

What's Hot

Paper page – A Systematic Analysis of Hybrid Linear Attention

Related Posts

Subscribe to Updates