Is this the subspace you are looking for? An Interpretability Illusion for Subspace Activation Patching
This repository contains code to reproduce results from the paper "An Interpretability Illusion for Subspace Activation Patching".
Description of relevant files:
data_utils.py
: tools for working with the IOI datasetmodel_utils.py
: tools to intervene on transformerlens models and train DAS subspacesioi_interventions.ipynb
: notebook to train DAS and related interventionsioi_analysis.ipynb
: notebook to analyze IOI interventions (using already trained/computed subspaces saved as files in this repository)
Description of relevant files:
fact_utils.py
: tools to download necessary datasetsfact_patching.ipynb
: code for fact patching experiments in sections 6.1. and 6.4. of the paperfact_patching_plots.ipynb
: notebook to recreate factual recall plots for sections 6.1 and 6.4. of the paperfact_editing.ipynb
: notebook to run fact editing experiments from section 6.3. of the paperfact_editing_plots.ipynb
: notebook to recreate ROME-to-subspace-intervention plots from section 6.3
theory_experiments.ipynb
: experiments for singular values of MLP weights and evaluating distortion introduced by GELU nonlinearity