#

mechanistic-interpretability

Here are 48 public repositories matching this topic...

Nix07 / binding-circuit-discovery

This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".

mechanistic-interpretability science-of-deep-learning

Updated Mar 12, 2024
Python

matthiasdellago / visualising-attention

Visualising (self)-attention as a vector field: exploring and building intuition. Based on anvaka.github.io/fieldplay.

visualization machine-learning transformer attention attention-mechanism vector-field mechanistic-interpretability

Updated May 6, 2023
GLSL

francescortu / comp-mech

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

interpretability llm mechanistic-interpretability

Updated May 24, 2024
Python

daspartho / pronoun-prediction

Identifying Circuit behind Pronoun Prediction in GPT-2 Small

interpretability gpt-2 mechanistic-interpretability

Updated May 24, 2023
Jupyter Notebook

zroe1 / toy-models-of-superposition

A replication of "Toy Models of Superposition," a groundbreaking machine learning research paper published by authors affiliated with Anthropic and Harvard in 2022.

machine-learning python3 pytorch toy-models mechanistic-interpretability

Updated Dec 30, 2023
Jupyter Notebook

evan-lloyd / graphpatch

graphpatch is a library for activation patching on PyTorch neural network models.

pytorch interpretability large-language-models mechanistic-interpretability

Updated Oct 23, 2024
Python

SAGARSS24 / MTB_manuscript_data

Physiological modeling into the metaverse of Mycobacterium tuberculosis beta CA inhibition mechanism

machine-learning systems-biology drug-design tuberculosis mechanism-of-action mechanistic-interpretability

Updated May 23, 2024

ishanjmukherjee / toy-models-of-superposition-replication

Replication of the Anthropic interpretability paper "Toy Models of Superposition" by Elhage et al. (2022)

interpretability mechanistic-interpretability

Updated Nov 2, 2024
Jupyter Notebook

AlejoAcelas / Interp-Benchmarks

Reversed-engineered Transformer models as a benchmark for interpretability methods

benchmark pytorch causal-analysis mechanistic-interpretability

Updated Feb 1, 2024
Jupyter Notebook

cx0 / mech-interpretability

Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.

ioi mechanistic-interpretability indirect-object-identification

Updated Jan 5, 2024
Python

koayon / awesome-sparse-autoencoders

A curated reading list of research in Sparse Autoencoders and related topics in Mechanistic Interpretability

sparse-coding interpretability mechanistic-interpretability

Updated Aug 8, 2024

Lejoon / cup-transformer

A project that simulates a game of shuffling cups with a hidden ball underneath one of them. It also trains a Transformer based deep learning model to predict the final position of the ball after a series of swaps.

deep-learning transformers mechanistic-interpretability

Updated Jan 7, 2024
Jupyter Notebook

AlejoAcelas / bayesian-transformers

Interpretability on 1-layer Transformer models that converge on the Bayesian-optimal solution for statistical tasks

transformers bayesian-inference mechanistic-interpretability

Updated Jan 8, 2024
Jupyter Notebook

Zhaoyi-Li21 / creme

[ACL'2024 Findings] "Understanding and Patching Compositional Reasoning in LLMs"

multi-hop-reasoning large-language-models mechanistic-interpretability compositional-reasoning factual-reasoning

Updated Aug 28, 2024
Python

Butanium / llm-latent-language

mechanistic-interpretability

Updated Aug 12, 2024
Jupyter Notebook

BatsResearch / cross-lingual-detox

Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages." Paper accepted at Findings of EMNLP 2024

nlp ai-safety generalization cross-lingual-transfer multilingual-nlp mechanistic-interpretability

Updated Oct 4, 2024
Jupyter Notebook

tim-lawson / mlsae

Multi-Layer Sparse Autoencoders

transformer sae sparse-autoencoder mechanistic-interpretability

Updated Nov 16, 2024
Python

Hari31416 / TorchLight

A librray to visualize features learned by CNNs

computer-vision feature-visualization mechanistic-interpretability

Updated Sep 22, 2024
Jupyter Notebook

AlejoAcelas / Organizer-Mech-Interp-Challenges

Organizer's repository for the Transformer Interpretability CodaBench competition

competitive-programming transformer mechanistic-interpretability

Updated Jan 8, 2024
Jupyter Notebook

THU-KEG / SafetyNeuron

Data and code for the paper: Finding Safety Neurons in Large Language Models

safety llms mechanistic-interpretability

Updated Sep 21, 2024
Jupyter Notebook

Improve this page

Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."