Is this the subspace you are looking for? An Interpretability Illusion for Subspace Activation Patching

This repository contains code to reproduce results from the paper "An Interpretability Illusion for Subspace Activation Patching".

Indirect Object Identification (IOI) task (sections 4 and 5)

Description of relevant files:

data_utils.py: tools for working with the IOI dataset
model_utils.py: tools to intervene on transformerlens models and train DAS subspaces
ioi_interventions.ipynb: notebook to train DAS and related interventions
ioi_analysis.ipynb: notebook to analyze IOI interventions (using already trained/computed subspaces saved as files in this repository)

Description of relevant files:

fact_utils.py: tools to download necessary datasets
fact_patching.ipynb: code for fact patching experiments in sections 6.1. and 6.4. of the paper
fact_patching_plots.ipynb: notebook to recreate factual recall plots for sections 6.1 and 6.4. of the paper
fact_editing.ipynb: notebook to run fact editing experiments from section 6.3. of the paper
fact_editing_plots.ipynb: notebook to recreate ROME-to-subspace-intervention plots from section 6.3

theory_experiments.ipynb: experiments for singular values of MLP weights and evaluating distortion introduced by GELU nonlinearity

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
figures		figures
patching_exp_outputs		patching_exp_outputs
subspaces		subspaces
.gitignore		.gitignore
README.md		README.md
batched_decorator.py		batched_decorator.py
common_imports.py		common_imports.py
das_mlp8.joblib		das_mlp8.joblib
das_resid.joblib		das_resid.joblib
das_resid_mid.joblib		das_resid_mid.joblib
data_utils.py		data_utils.py
fact_editing.ipynb		fact_editing.ipynb
fact_editing_plots.ipynb		fact_editing_plots.ipynb
fact_patching.ipynb		fact_patching.ipynb
fact_patching_plots.ipynb		fact_patching_plots.ipynb
fact_patching_rome_edit_rows.joblib		fact_patching_rome_edit_rows.joblib
fact_utils.py		fact_utils.py
ioi_analysis.ipynb		ioi_analysis.ipynb
ioi_experiments.ipynb		ioi_experiments.ipynb
ioi_interventions.ipynb		ioi_interventions.ipynb
model_utils.py		model_utils.py
name_mover_gradients.joblib		name_mover_gradients.joblib
patching_metrics_ioi.joblib		patching_metrics_ioi.joblib
rank1_to_subspace_results.joblib		rank1_to_subspace_results.joblib
rome_requests.json		rome_requests.json
rome_utils.py		rome_utils.py
rome_vs_subsp_results.joblib		rome_vs_subsp_results.joblib
summed_gradient.joblib		summed_gradient.joblib
theory_experiments.ipynb		theory_experiments.ipynb
v_mean.joblib		v_mean.joblib