Toy Models of Superposition is a groundbreaking machine learning research paper published by authors affiliated with Anthropic and Harvard in 2022. The original paper focuses on investigating how small “toy models” are able to represent more features than they have neurons (a phenomenon the authors call “superposition”).
This repository includes a replication of the experiments from the introduction and sections 2 and 3 of the original paper. I write up all my findings in a PDF document which can be found under the filename FINDINGS.pdf in this repository. This document details the results of my experiments along with additional commentary from section 1 of the original paper.
- A complete writeup of all my findings: This is found under the filename FINDINGS.pdf in this repository. Because this repository is dedicated to being a paper replication, it felt natural to write down my findings in one complete document. The LaTeX file used to create the pdf is also provided under the name FINDINGS.tex.
- The code: Jupyter notebooks are provided which can be used to generate results of all the experiments I conducted in my replication. These experiments are organized into folders and given descriptive filenames so everything is easy to find.
- Summary: For people who (reasonably) don't want to read through all of FINDINGS.pdf, I have provided a more concise summary of my findings in this markdown file (which admittedly also isn't too short).
This summary is a brief overview of my findings from FINDINGS.pdf.
In this section, I provide a motivating example of superposition by replicating a figure provided in the introduction of the original paper. My replication (shown below), illustrates the internal structure of a toy model by graphing each column of the model’s weight matrix as a vector.
The model studied had two neurons meaning the columns of the weight matrix could be understood as 2D vectors representing distinct inputs. These vectors could then be graphed to show the direction and extent a feature was represented. This experiment ultimately shows that if the training data for a model is sparse enough, it can represent more features than it has neurons (which is the precise definition of superposition).
This section of my replication explores both a linear model defined by
In the graphic below, the grids on the top represent
Increasing the sparsity of the ReLU model (ReLU(
As one continues to increase the sparsity of the ReLU model, it ceases to represent any features orthogonally. Note that all the bar charts in the bottom of the
figure below are colored blue to illustrate that the model is representing them in superposition. As a result, the grids representing
This section of the paper is still being developed. The information below is what I have found so far, but I have already updated some of my beliefs and think it is likely that I will again in the near future. As a result, this repository will be updated in the near future with new findings.
The authors of Toy Models of Superposition claim that transitions from different internal structures within a model can be thought of as phase changes.
The graphic below shows three phase diagrams for one-neuron-models. Note that I found that when training a group of ReLU models (ReLU(
This part of the paper was perhaps the most difficult. It involved training 1,000,000 one-neuron ReLU models and 100,000 one-neuron linear models.
This replication demonstrates that it is possible for models to represent features in superposition. It also shows that the phenomenon of superposition is somewhat predictable. For example, models trained on sparse data are more likely to represent the features in superposition.
With that being said, there are some serious limitations to thinking about neural networks in this way. In all the examples in this replication, the result was highly dependent on the exact training conditions such as learning rate and batch size. Thus, I believe that it is wise to be cautious when making broad claims about models such as "model x will represent less information in superposition than model y because model x is larger." The truth is that there are a number of important factors that influence whether or not a model will represent information orthogonally or in superposition.