This repository contains the code for all experiments in the paper "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small" (Wang et al, 2022).
This is intended as a one-time code drop. The authors recommend those interested in mechanistic interpretability use the Transformer Lens library.
Specifically, this TransformerLens demo goes through a number of experiments from the Interpretability in the Wild paper, and also introduces other features of that library, which are helpful for building off of our research.
Contact arthurconmy@gmail.com or comment on this PR (sadly issues don't work for forks) for proposed changes.
See and run the experiments on Google Colab.
pip install git+https://github.com/redwoodresearch/Easy-Transformer.git
git clone https://github.com/redwoodresearch/Easy-Transformer/
pip install -r requirements.txt
In this repo, you can find the following notebooks (some are in easy_transformer/
):
experiments.py
: a notebook of several of the most interesting experiments of the IOI project.completeness.py
: a notebook that generate the completeness plots in the paper, and implements the completeness functions.minimality.py
: as above for minimality.advex.py
: a notebook that generates adversarial examples as in the paper. `
(later renamed "TransformerLens")
It supports the importation of open sources models, a convenient handling of hooks to get access to intermediate activations and features to perform simple emperiments such as ablations and patching.
A demo notebook can be found here, with links to other tutorials and demos too.