Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
-
Updated
Nov 6, 2024 - Python
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
This repository collects all relevant resources about interpretability in LLMs
Decomposing and Editing Predictions by Modeling Model Computation
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Interpreting how transformers simulate agents performing RL tasks
Steering vectors for transformer language models in Pytorch / Huggingface
🧠 Starter templates for doing interpretability research
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
Sparse and discrete interpretability tool for neural networks
Sparse probing paper full code.
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
Generating and validating natural-language explanations.
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Universal Neurons in GPT2 Language Models
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages." Paper accepted at Findings of EMNLP 2024
CoSy: Evaluating Textual Explanations
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
Multi-Layer Sparse Autoencoders
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."