#

mechanistic-interpretability

Here are 16 public repositories matching this topic...

Nix07 / binding-circuit-discovery

This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".

mechanistic-interpretability science-of-deep-learning

Updated Mar 12, 2024
Python

francescortu / comp-mech

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

interpretability llm mechanistic-interpretability

Updated May 24, 2024
Python

evan-lloyd / graphpatch

graphpatch is a library for activation patching on PyTorch neural network models.

pytorch interpretability large-language-models mechanistic-interpretability

Updated Oct 23, 2024
Python

cx0 / mech-interpretability

Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.

ioi mechanistic-interpretability indirect-object-identification

Updated Jan 5, 2024
Python

Zhaoyi-Li21 / creme

[ACL'2024 Findings] "Understanding and Patching Compositional Reasoning in LLMs"

multi-hop-reasoning large-language-models mechanistic-interpretability compositional-reasoning factual-reasoning

Updated Aug 28, 2024
Python

tim-lawson / mlsae

Multi-Layer Sparse Autoencoders

transformer sae sparse-autoencoder mechanistic-interpretability

Updated Nov 17, 2024
Python

aarnphm / tinymorph

exploration WYSIWYG editor

experimental interface sparse-autoencoder capstone-project mechanistic-interpretability

Updated Nov 7, 2024
Python

DanielJamesDavies / Turing-LLM-1.0-254M

A framework for conducting interpretability research and for developing an LLM from a synthetic dataset.

python sparse-autoencoders interpretability mechanistic-interpretability large-language-model

Updated Sep 10, 2024
Python

AlejoAcelas / Mech-Interp-Challenges

Starting Kit for the CodaBench competition on Transformer Interpretability

competitive-programming transformer mechanistic-interpretability

Updated Sep 8, 2023
Python

koayon / atp_star

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

machine-learning large-language-models mechanistic-interpretability

Updated Apr 16, 2024
Python

aryamanarora / causalgym

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

benchmark causality interpretability mechanistic-interpretability syntaxgym

Updated Feb 27, 2024
Python

taufeeque9 / codebook-features

Sparse and discrete interpretability tool for neural networks

transformers features language-model interpretability codebook mechanistic-interpretability

Updated Feb 12, 2024
Python

steering-vectors / steering-vectors

Steering vectors for transformer language models in Pytorch / Huggingface

nlp ai pytorch gpt huggingface mechanistic-interpretability representation-engineering

Updated Oct 8, 2024
Python

pauljblazek / deepdistilling

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms

program-synthesis knowledge-distillation inductive-logic-programming domain-adaptation explainable-ai interpretable distilling neurosymbolic model-distillation out-of-distribution-generalization mechanistic-interpretability

Updated Feb 20, 2024
Python

OpenMOSS / Language-Model-SAEs

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.

sparse-autoencoders interpretability sparse-dictionary mechanistic-interpretability

Updated Nov 15, 2024
Python

pyvene

stanfordnlp / pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

intervention interpretability mechanistic-interpretability activation-intervention activation-patching

Updated Nov 6, 2024
Python

Improve this page

Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."