View a PDF of the paper titled Efficient Image Generation with Variadic Attention Heads, by Steven Walton and 4 other authors
View PDF
HTML (experimental)
Abstract:While the integration of transformers in vision models have yielded significant improvements on vision tasks they still require significant amounts of computation for both training and inference. Restricted attention mechanisms significantly reduce these computational burdens but come at the cost of losing either global or local coherence. We propose a simple, yet powerful method to reduce these trade-offs: allow the attention heads of a single transformer to attend to multiple receptive fields.
We demonstrate our method utilizing Neighborhood Attention (NA) and integrate it into a StyleGAN based architecture for image generation. With this work, dubbed StyleNAT, we are able to achieve a FID of 2.05 on FFHQ, a 6% improvement over StyleGAN-XL, while utilizing 28% fewer parameters and with 4$\times$ the throughput capacity. StyleNAT achieves the Pareto Frontier on FFHQ-256 and demonstrates powerful and efficient image generation on other datasets. Our code and model checkpoints are publicly available at: this https URL
Submission history
From: Steven Walton [view email]
[v1]
Thu, 10 Nov 2022 18:55:48 UTC (20,378 KB)
[v2]
Sun, 13 Aug 2023 00:03:25 UTC (56,184 KB)
[v3]
Thu, 26 Jun 2025 05:07:48 UTC (24,956 KB)