This repository reproduces results of Anthropic's Sparse Dictionary Learning paper. The codebase is quite rough, but the results are excellent. See the feature interface to browse through the features learned by the sparse autoencoder. There are improvements to be made (see the TODOs section below), and I will work on them intermittently as I juggle things in life :)
I trained a 1-layer transformer model from scratch using nanoGPT with
a feature each for German, Japanese, and many other languages, as well many other interesting features:
- A feature for German
- A feature for Scandinavian languages
- A feature for Japanese
- A feature for Hebrew
- A feature for Cyrilic vowels
- A feature for token "at" in words like "Croatian", "Scat", "Hayat", etc
- A single token feature for "much"
- A feature for sports leagues: NHL, NBA, etc
- A feature for Gregorian calendar dates
- A feature for "when": - this feature particularly stands out because of the size of the mode around large activation values.
- A feature for "&"
- A feature for ")"
- A feature for "v" in URLs like "com/watch?v=SiN8
- A feature for programming code
- A feature for Donald Trump
- A feature for LaTeX
I used the "OpenWebText" dataset to train the transformer model, to generate the MLP activations dataset for the autoencoder, and to generate the feature interface visualizations. The transformer model had
I collected the dataset of 4B MLP activations by performing forward pass on 20M prompts (each of length 1024), keeping 200 activation vectors from each prompt. Next, I trained the autoencoder for approximately
For the most part, I followed the training procedure described in the appendix of Anthropic's original paper. I did not follow the improvements they suggested in their January and February updates.
- Incorporate the effects of feature ablations in the feature interface.
- Implement an interface to see "Feature Activations on Example Texts" as done by Anthropic here.
- Modify the code so that one can train a sparse autoencoder on activations of any MLP / attention layer.
There are several other very interesting works on the web exploring sparse dictionary learning. Here is a small subset of them.
- Sparse Autoencoders Find Highly Interpretable Features in Language Models by Cunningham, et al.
- Sparse Autoencoders Work on Attention Layer Outputs by Kissane, et al.
- Joseph Bloom's SAE codebase along with a blogpost on trained SAEs for all residual stream layers of GPT-2 small
- Neel Nanda's SAE codebase along with a blogpost
- Callum McDougall's exercises on SAEs
- SAE library by AI Safey Foundation