Skip to content

Latest commit

 

History

History
3 lines (2 loc) · 410 Bytes

README.md

File metadata and controls

3 lines (2 loc) · 410 Bytes

Rigorously Assessing Natural Language Explanations of Neurons

We develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy.