Skip to content

An Open Source Implementation of Anthropic's Paper: "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning"

Notifications You must be signed in to change notification settings

shehper/sparse-dictionary-learning

Repository files navigation

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

This repository reproduces results of Anthropic's Sparse Dictionary Learning paper. The codebase is quite rough, but the results are excellent. See the feature interface to browse through the features learned by the sparse autoencoder. There are improvements to be made (see the TODOs section below), and I will work on them intermittently as I juggle things in life :)

I trained a 1-layer transformer model from scratch using nanoGPT with $d_{\text{model}} = 128$. Then, I trained a sparse autoencoder with $4096$ features on its MLP activations as in Anthropic's paper. 93% of the autoencoder neurons were alive, only 5% of which were of ultra-low density. There are several interesting features. For example, there is a feature for French language,

a feature each for German, Japanese, and many other languages, as well many other interesting features:

Training Details

I used the "OpenWebText" dataset to train the transformer model, to generate the MLP activations dataset for the autoencoder, and to generate the feature interface visualizations. The transformer model had $d_{\text{model}}= 128$, $d_{\text{MLP}} = 512$, and $n_{\text{head}}= 4$. I trained this model for $2 \times 10^5$ iterations to roughly match the number of epochs with Anthropic's training procedure.

I collected the dataset of 4B MLP activations by performing forward pass on 20M prompts (each of length 1024), keeping 200 activation vectors from each prompt. Next, I trained the autoencoder for approximately $5 \times 10^5$ training steps at batch size 8192 and learning rate $3 \times 10^{-4}$. I performed neuron resampling 4 times during training at training steps $2.5 \times i \times 10^4$ for $i=1, 2, 3, 4$. See a complete log of the training run on the W&B page. The L1-coefficient for this training run is $10^{-3}$. I selected the L1-coefficient and the learning rate by performing a grid search.

For the most part, I followed the training procedure described in the appendix of Anthropic's original paper. I did not follow the improvements they suggested in their January and February updates.

TODOs

  • Incorporate the effects of feature ablations in the feature interface.
  • Implement an interface to see "Feature Activations on Example Texts" as done by Anthropic here.
  • Modify the code so that one can train a sparse autoencoder on activations of any MLP / attention layer.

Related Work

There are several other very interesting works on the web exploring sparse dictionary learning. Here is a small subset of them.

About

An Open Source Implementation of Anthropic's Paper: "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages