This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
-
Updated
Mar 12, 2024 - Python
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
graphpatch is a library for activation patching on PyTorch neural network models.
Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.
[ACL'2024 Findings] "Understanding and Patching Compositional Reasoning in LLMs"
Multi-Layer Sparse Autoencoders
exploration WYSIWYG editor
A framework for conducting interpretability research and for developing an LLM from a synthetic dataset.
Starting Kit for the CodaBench competition on Transformer Interpretability
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Sparse and discrete interpretability tool for neural networks
Steering vectors for transformer language models in Pytorch / Huggingface
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."