Generalized Neighborhood Attention: Multi-dimensional Sparse Attention At The Speed Of Light

arXiv:2504.16922v1 Announce Type: cross
Abstract: Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.

Source link

What's Hot

Perplexity AI Pro Review: Research with Real-Time Insights

Wexler Bags $5.3m – Interview With CEO, Gregory Mostyn – Artificial Lawyer

VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video Diffusion Models – Takara TLDR

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Court Rules ‘Gender Ideology’ Ban on Art Endowments Unconstitutional

Rural Danish Art Museum Acquires Painting By Artemisia Gentileschi

Dan Nadel Is Expanding American Art History, One Outlier at a Time

St. Patrick’s Cathedral Unveils Monumental Mural by Adam Cvijanovic

Perplexity AI Pro Review: Research with Real-Time Insights

Wexler Bags $5.3m – Interview With CEO, Gregory Mostyn – Artificial Lawyer

VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video Diffusion Models – Takara TLDR

What's Hot

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

Related Posts

Subscribe to Updates