This repository tracks the latest research on sparse autoencoders, specifically used for mechanistic interpretability. The goal is to offer a comprehensive list of papers and resources relevant to the topic.
Note
If you believe your paper, blog post, or other resources on sparse autoencoders are not included, or if you find a mistake, typo, or outdated information, please open an issue or submit a pull request. I will be happy to update the list.
- Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
- Author(s): Maheep Chaudhary, Atticus Geiger
- Date: 2024-09
- Venue: -
- Code: -
- Residual Stream Analysis with Multi-Layer SAEs
- Author(s): Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison
- Date: 2024-09
- Venue: -
- Code: -
- Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
- Author(s): Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda
- Date: 2024-08
- Venue: -
- Code: -
- Disentangling Dense Embeddings with Sparse Autoencoders
- Author(s): Charles O'Neill, Christine Ye, Kartheik Iyer, John F. Wu
- Date: 2024-08
- Venue: -
- Code: -
- Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
- Author(s): Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
- Date: 2024-08
- Venue: -
- Code: -
- Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery
- Author(s): Sukrut Rao, Sweta Mahajan, Moritz Böhle, Bernt Schiele
- Date: 2024-07
- Venue: -
- Code: -
- Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
- Author(s): Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda
- Date: 2024-07
- Venue: -
- Code: -
- Interpreting Attention Layer Outputs with Sparse Autoencoders
- Author(s): Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda
- Date: 2024-06
- Venue: -
- Code: -
- Transcoders Find Interpretable LLM Feature Circuits
- Author(s): Jacob Dunefsky, Philippe Chlenski, Neel Nanda
- Date: 2024-06
- Venue: -
- Code: -
- Scaling and evaluating sparse autoencoders
- Author(s): Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu
- Date: 2024-06
- Venue: -
- Code: openai/sparse_autoencoder, EleutherAI/sae: Sparse autoencoders
- Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
- Author(s): Yoann Poupart
- Date: 2024-06
- Venue: -
- Code: -
- The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision
- Author(s): Liv Gorton
- Date: 2024-06
- Venue: -
- Code: -
- Not All Language Model Features Are Linear
- Author(s): Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark
- Date: 2024-05
- Venue: -
- Code: -
- Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
- Author(s): Charles O'Neill, Thang Bui
- Date: 2024-05
- Venue: -
- Code: -
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- Author(s): Aleksandar Makelov, George Lange, Neel Nanda
- Date: 2024-05
- Venue: -
- Code: -
- Improving Dictionary Learning with Gated Sparse Autoencoders
- Author(s): Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda
- Date: 2024-04
- Venue: -
- Code: -
- Sparse Autoencoders Find Highly Interpretable Features in Language Models
- Author(s): Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
- Date: 2023-09
- Venue: -
- Code: -
- Extracting SAE task features for in-context learning — LessWrong
- Author(s): Dmitrii Kharlapenko, neverix, Neel Nanda, Arthur Conmy
- Date: 2024-08-13
- Self-explaining SAE features
- Author(s): Dmitrii Kharpalenko, neverix, Neel Nanda, Arthur Conmy
- Date: 2024-08-06
- A primer on sparse autoencoders - by Nick Jiang
- Author(s): Nick Jiang
- Date: 2024-07-03
- An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability
- Author(s): Adam Karvonen
- Date: 2024-06-11
- Finding Sparse Linear Connections between Features in LLMs
- Author(s): Logan Riggs Smith, Sam Mitchell, Adam Kaufman
- Date: 2023-12-09
- Sparse Autoencoders: Future Work
- Author(s): Logan Riggs Smith, Aidan Ewart
- Date: 2024-09-21
- Sparse Autoencoders Find Highly Interpretable Directions in Language Models
- Author(s): Logan Riggs, Hoagy, Aidan Ewart, Robert_AIZI
- Date: 2023-09-21