mechanistic-interpretability

Star

Here are 48 public repositories matching this topic...

stanfordnlp / pyvene

Star

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

intervention interpretability mechanistic-interpretability activation-intervention activation-patching

Updated Nov 6, 2024
Python

ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models

Star

This repository collects all relevant resources about interpretability in LLMs

dictionary-learning sparse-autoencoder interpretability-and-explainability mechanistic-interpretability

Updated Nov 1, 2024

MadryLab / modelcomponents

Star

Decomposing and Editing Predictions by Modeling Model Computation

attribution pytorch interpretability model-editing mechanistic-interpretability

Updated Jun 12, 2024
Jupyter Notebook

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms

program-synthesis knowledge-distillation inductive-logic-programming domain-adaptation explainable-ai interpretable distilling neurosymbolic model-distillation out-of-distribution-generalization mechanistic-interpretability

Updated Feb 20, 2024
Python

jbloomAus / DecisionTransformerInterpretability

Star

Interpreting how transformers simulate agents performing RL tasks

reinforcement-learning mechanistic-interpretability

Updated Oct 23, 2023
Jupyter Notebook

steering-vectors / steering-vectors

Star

Steering vectors for transformer language models in Pytorch / Huggingface

nlp ai pytorch gpt huggingface mechanistic-interpretability representation-engineering

Updated Oct 8, 2024
Python

apartresearch / interpretability-starter

Star

🧠 Starter templates for doing interpretability research

interpretability interpretability-jam alignment-jam mechanistic-interpretability

Updated Jul 16, 2023

epfl-dlab / llm-latent-language

Star

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".

multilingual-nlp llm mechanistic-interpretability llama2

Updated Mar 11, 2024
Jupyter Notebook

taufeeque9 / codebook-features

Star

Sparse and discrete interpretability tool for neural networks

transformers features language-model interpretability codebook mechanistic-interpretability

Updated Feb 12, 2024
Python

wesg52 / sparse-probing-paper

Star

Sparse probing paper full code.

ai-safety interpretability ai-alignment mechanistic-interpretability

Updated Dec 17, 2023
Jupyter Notebook

OpenMOSS / Language-Model-SAEs

Star

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.

sparse-autoencoders interpretability sparse-dictionary mechanistic-interpretability

Updated Nov 15, 2024
Python

microsoft / automated-explanations

Star

Generating and validating natural-language explanations.

data-science machine-learning neuroscience artificial-intelligence fmri gpt explanation language-model interpretability xai fmri-data-analysis huggingface gpt4 large-language-models mechanistic-interpretability automated-interpretability

Updated Nov 15, 2024
HTML

aryamanarora / causalgym

Star

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

benchmark causality interpretability mechanistic-interpretability syntaxgym

Updated Feb 27, 2024
Python

wesg52 / universal-neurons

Star

Universal Neurons in GPT2 Language Models

ai-safety interpretability llm mechanistic-interpretability

Updated May 28, 2024
Jupyter Notebook

yash-srivastava19 / arrakis

Sponsor

Star

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

transformer garcon explainable-ai mechanistic-interpretability anthropic transformerlens

Updated Aug 4, 2024
Jupyter Notebook

Nix07 / finetuning

Star

This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".

finetuning entity-tracking mechanistic-interpretability science-of-deep-learning

Updated Mar 21, 2024
Jupyter Notebook

BatsResearch / cross-lingual-detox

Star

Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages." Paper accepted at Findings of EMNLP 2024

nlp ai-safety generalization cross-lingual-transfer multilingual-nlp mechanistic-interpretability

Updated Oct 4, 2024
Jupyter Notebook

lkopf / cosy

Star

CoSy: Evaluating Textual Explanations

machine-learning xai-evaluation mechanistic-interpretability global-explainability

Updated Oct 11, 2024
Jupyter Notebook

koayon / atp_star

Star

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

machine-learning large-language-models mechanistic-interpretability

Updated Apr 16, 2024
Python

tim-lawson / mlsae

Star

Multi-Layer Sparse Autoencoders

transformer sae sparse-autoencoder mechanistic-interpretability

Updated Nov 16, 2024
Python

Improve this page

Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mechanistic-interpretability

Here are 48 public repositories matching this topic...

stanfordnlp / pyvene

ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models

MadryLab / modelcomponents

pauljblazek / deepdistilling

jbloomAus / DecisionTransformerInterpretability

steering-vectors / steering-vectors

apartresearch / interpretability-starter

epfl-dlab / llm-latent-language

taufeeque9 / codebook-features

wesg52 / sparse-probing-paper

OpenMOSS / Language-Model-SAEs

microsoft / automated-explanations

aryamanarora / causalgym

wesg52 / universal-neurons

yash-srivastava19 / arrakis

Nix07 / finetuning

BatsResearch / cross-lingual-detox

lkopf / cosy

koayon / atp_star

tim-lawson / mlsae

Improve this page

Add this topic to your repo