Existing Sparse Autoencoder (SAE) training algorithms often lack rigorous mathematical guarantees for feature recovery. Empirically, methods such as L1 regularization and TopK activation are sensitive to hyperparameter tuning and can exhibit inconsistency. Our work addresses these theoretical and practical issues with the following contributions:
📊 A novel statistical framework that rigorously formalizes feature recovery by modeling polysemantic features as sparse combinations of underlying monosemantic concepts, and establishes a precise notion of feature identifiability.
🛠️ An innovative SAE training algorithm, Group Bias Adaptation (GBA), which adaptively adjusts neural network bias parameters to enforce optimal activation sparsity, allowing distinct groups of neurons to target different activation frequencies.
🧮 The first theoretical guarantee proving that SAE training algorithm can provably recover all monosemantic features when the input data is sampled from our proposed statistical model.
🚀 Superior empirical performance on LLMs up to 1.5B parameters, where GBA achieves the best sparsity-loss trade-off while learning more consistent features than benchmark methods.