This repository provides detailed explainability analyses for Transformer models trained on algorithmic tasks. The explanations map model activations to simplified causal graphs that recover over 90% of the original model loss. This makes them valuable benchmarks for evaluating interpretability techniques.
The repository assumes basic knowledge of interpretability methods like causal scrubbing and circuit-style analysis of Transformer models. Key items:
- Explanations formatted as simplified causal models
- Resampling tests quantify explanation accuracy
- Matches or exceeds the accuracy of previous attemps at causal scrubbing analysis on algorithmic tasks
- Ideal for benchmarking new interpretability techniques
- Notebooks and scripts for training, evaluation, and analysis
- Modular codebase for extending to new models and tasks
By open sourcing detailed analyses tied to model performance, this repository aims to advance interpretability research. Contributions welcome!