This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
-
Updated
Mar 12, 2024 - Python
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
Visualising (self)-attention as a vector field: exploring and building intuition. Based on anvaka.github.io/fieldplay.
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Identifying Circuit behind Pronoun Prediction in GPT-2 Small
A replication of "Toy Models of Superposition," a groundbreaking machine learning research paper published by authors affiliated with Anthropic and Harvard in 2022.
graphpatch is a library for activation patching on PyTorch neural network models.
Physiological modeling into the metaverse of Mycobacterium tuberculosis beta CA inhibition mechanism
Replication of the Anthropic interpretability paper "Toy Models of Superposition" by Elhage et al. (2022)
Reversed-engineered Transformer models as a benchmark for interpretability methods
Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.
A curated reading list of research in Sparse Autoencoders and related topics in Mechanistic Interpretability
A project that simulates a game of shuffling cups with a hidden ball underneath one of them. It also trains a Transformer based deep learning model to predict the final position of the ball after a series of swaps.
Interpretability on 1-layer Transformer models that converge on the Bayesian-optimal solution for statistical tasks
[ACL'2024 Findings] "Understanding and Patching Compositional Reasoning in LLMs"
Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages." Paper accepted at Findings of EMNLP 2024
Multi-Layer Sparse Autoencoders
A librray to visualize features learned by CNNs
Organizer's repository for the Transformer Interpretability CodaBench competition
Data and code for the paper: Finding Safety Neurons in Large Language Models
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."