GitHub - callummcdougall/sae-exercises-mats

This is a GitHub repo for hosting some exercises in sparse autoencoders, which I've recently finished working on as part of the upcoming ARENA 3.0 iteration. Having spoken to Neel Nanda and others in interpretability-related MATS streams, it seemed useful to make these exercises accessible out of the context of the rest of the ARENA curriculum.

Links to Colabs: Exercises, Solutions.

If you don't like working in Colabs, then you can clone this repo, download the exercises & solutions Colabs as notebooks, and run them in the same directory.

The exercises were built out from the Toy Models of Superposition exercises from the previous iteration, but now with new Sparse Autoencoder content. These exercises fall into 2 groups:

SAEs in toy models

We take the toy models from Anthropic's Toy Models of Superposition paper (which there are also exercises for), and train sparse autoencoders on the representations learned by these toy models. These exercises culminate in using neuron resampling to successfully recover all the learned features from the toy model of bottleneck superposition - see this animation.

SAEs in real models

And there are exercises on interpreting an SAE trained on a transformer, where you can discover some cool learned features (e.g. a neuron exhibiting skip trigam-like behaviour, which activates on left-brackets following Django-related sytax, and predicts the completion (' -> django).

You can either read through the Solutions colab (which has all output displayed & explained), or go through the Exercises colab and fill in the functions according to the specifications you are given, looking at the Solutions when you're stuck. Both colabs come with test functions you can run to verify your solution works.

List of all exercises

I've listed all the exercises down here, along with prerequisites (although I expect most readers will only be interested in the sparse autoencoder exercises). Each set of exercises is labelled with their prerequisites. For instance, the label (1*, 3) means the first set of exercises is essential, and the third is recommended but not essential.

Abbreviations: TMS = Toy Models of Superposition, SAE = Sparse Autoencoders.

TMS: Superposition in a Nonprivileged Basis
TMS: Correlated / Anticorrelated Features (1*)
TMS: Superposition in a Privileged Basis (1*)
TMS: Feature Geometry (1*)
SAEs in Toy Models (1*, 3)
SAEs in Real Models (1*, 5*, 3)

Please reach out to me if you have any questions or suggestions about these exercises (either by email at cal.s.mcdougall@gmail.com, or a LessWrong private message / comment on this post). Happy coding!

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md
animation_2.gif		animation_2.gif
plotly_utils.py		plotly_utils.py
plotly_utils_toy_models.py		plotly_utils_toy_models.py
solutions.py		solutions.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Links to Colabs: Exercises, Solutions.

SAEs in toy models

SAEs in real models

List of all exercises

About

Releases

Packages

Languages

callummcdougall/sae-exercises-mats

Folders and files

Latest commit

History

Repository files navigation

Links to Colabs: Exercises, Solutions.

SAEs in toy models

SAEs in real models

List of all exercises

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages