Start Over

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Authors :: Olmo, Jeffrey
Wilson, Jared
Forsey, Max
Hepner, Bryce
Howe, Thomas Vin
Wingate, David
Publication Year :: 2024
Abstract: Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both $\textit{representations}$, retrospectively, and $\textit{actions}$, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.<br />Comment: 9 pages, 8 figures. Submitted to NAACL 2025

Subjects :: Computer Science - Machine Learning
Computer Science - Artificial Intelligence
Computer Science - Computation and Language

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2411.10397
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources