Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)

#fastweights #deeplearning #transformers

Transformers are dominating Deep Learning, but their quadratic memory and compute requirements make them expensive to train and hard to use. Many papers have attempted to linearize the core module: the attention mechanism, using kernels – for example, the Performer. However, such methods are either not satisfactory or have other downsides, such as a reliance on random features. This paper establishes an intrinsic connection between linearized (kernel) attention and the much older Fast Weight Memory Systems, in part popularized by Jürgen Schmidhuber in the 90s. It shows the fundamental limitations of these algorithms and suggests new update rules and new kernels in order to fix these problems. The resulting model compares favorably to Performers on key synthetic experiments and real-world tasks.

OUTLINE:
0:00 – Intro & Overview
1:40 – Fast Weight Systems
7:00 – Distributed Storage of Symbolic Values
12:30 – Autoregressive Attention Mechanisms
18:50 – Connecting Fast Weights to Attention Mechanism
22:00 – Softmax as a Kernel Method (Performer)
25:45 – Linear Attention as Fast Weights
27:50 – Capacity Limitations of Linear Attention
29:45 – Synthetic Data Experimental Setup
31:50 – Improving the Update Rule
37:30 – Deterministic Parameter-Free Projection (DPFP) Kernel
46:15 – Experimental Results
50:50 – Conclusion & Comments

Paper:
Code:
Machine Learning Street Talk on Kernels:

Abstract:
We show the formal equivalence of linearised self-attention mechanisms and fast weight memories from the early ’90s. From this observation we infer a memory capacity limitation of recent linearised softmax attention variants. With finite memory, a desirable behaviour of fast weight memory models is to manipulate the contents of memory and dynamically interact with it. Inspired by previous work on fast weights, we propose to replace the update rule with an alternative rule yielding such behaviour. We also propose a new kernel function to linearise attention, balancing simplicity and effectiveness. We conduct experiments on synthetic retrieval problems as well as standard machine translation and language modelling tasks which demonstrate the benefits of our methods.

Authors: Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber

Links:
TabNine Code Completion (Referral):
YouTube:
Twitter:
Discord:
BitChute:
Minds:
Parler:
LinkedIn:
BiliBili:

If you want to support me, the best thing to do is to share out the content 🙂

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar:
Patreon:
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

source

What's Hot

Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs – Takara TLDR

Reflection AI lands $2B at $8B valuation to expand frontier AI infrastructure and safety research

Here's what's slowing down your AI strategy — and how to fix it

Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)

[Paper Analysis] On the Theoretical Limitations of Embedding-Based Retrieval (Warning: Rant)

AGI is not coming!

Context Rot: How Increasing Input Tokens Impacts LLM Performance (Paper Analysis)

Smithsonian Closes Museums Amid Government Shutdown

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs – Takara TLDR

Reflection AI lands $2B at $8B valuation to expand frontier AI infrastructure and safety research

Here's what's slowing down your AI strategy — and how to fix it

What's Hot

Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)

Related Posts

Subscribe to Updates