Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we present a benchmark: RAVEL (Resolving Attribute–Value Entanglements in Language Models).
A demo on how to evaluate Sparse Autoencoder (SAE), Distributed Alignment Search (DAS), and Multi-task Distributed Alignment Search (MDAS) on RAVEL with TinyLlama.
-
Install dependencies in
requirements.txt
pip install -r requirements.txt
-
Install pyvene from GitHub:
git clone git@github.com:stanfordnlp/pyvene.git
The code has been tested against the pyvene version at commit d29f9591ca61753d66ba25f6cc3a4c05bab48480
.
RAVEL provides an entity-attribute dataset covering factual, linguistic, and commonsense knowledge. The dataset contains five types of entities, each with at least 500 instances, at least 4 attributes, and at least 50 prompt templates, as shown in the table below.
Entity Type | Attributes | #Entities | #Prompt Templates |
---|---|---|---|
City | Country, Language, Latitude, Longitude,Timezone, Continent | 3552 | 150 |
Nobel Laureate | Award Year, Birth Year, Country of Birth, Field, Gender | 928 | 100 |
Verb | Definition, Past Tense, Pronunciation, Singular | 986 | 60 |
Physical Object | Biological Category, Color, Size, Texture | 563 | 60 |
Occupation | Duty, Gender Bias, Industry, Work Location | 799 | 50 |
Compared with existing entity-attribute/relation datasets, RAVEL offers two unique features:
- multiple attributes per entity to evaluate how well interpretability methods isolate individual concepts
- x10 more entities per entity type to evaluate how well interpretability methods generalize
Each entity_type
is associated with five files:
- entity:
ravel_{entity_type}_entity_attributes.json
- prompt:
ravel_{entity_type}_attribute_to_prompts.json
- wiki prompt:
wikipedia_{entity_type}_entity_prompts.json
- entity split:
ravel_{entity_type}_entity_to_split.json
- prompt split:
ravel_{entity_type}_prompt_to_split.json
The first three contain all the entities and prompt templates. The last two contain the dataset splits.
The entity file is structured as follows:
{
"Paris": {
"Continent": "Europe",
"Country": "France",
"Language": "French",
"Latitude": "49",
"Longitude": "2",
"Timezone": "Europe/Paris"
},
...
}
The prompt file is structured as follows:
{
"Country": [
"%s is a city in the country of",
...
],
"Continent": [
"Los Angeles is a city in the continent of North America. %s is a city in the continent of",
...
],
"Latitude": [
"[{\"city\": \"Bangkok\", \"lat\": \"13.8\"}, {\"city\": \"%s\", \"lat\": \"",
...
],
...
}
We evaluate whether interpretability methods can disentangle related concepts, e.g., can a method find a feature of hidden activations that isolate the continent a city is in from the country that city is in? If so, an intervention on the feature should change the first without changing the latter, as shown in the figure below.
Each interpretability method defines a bijective featurizer
The main evaluation logic is implemented in the function utils.intervention_utils.eval_with_interventions
, with each method implements its own interchange intervention logic in src/methods.
A core operation in our evaluation framework is interchange intervention, which puts models into counterfactual states that allow us to isolate the causal effects of interest. Interchange intervention involves a pair of examples, which are referred to as base
and source
. For each pair, we specify the desired model output upon interventions, namely, whether the output should match the attribute value of the base entity or the attribute value of the source entity.
Each evaluation example is structured as follows:
{
'input': 'city to country: Rome is in Italy. Tokyo is in',
'label': ' Japan',
'source_input': ' in what is now southern Vancouver',
'source_label': ' Island',
'inv_label': ' Canada',
'split': 'city to country: Rome is in Italy. %s is in',
'source_split': ' in what is now southern %s',
'entity': 'Tokyo',
'source_entity': 'Vancouver'
}
The input and label fields are:
input
: the base example inputsource_input
: the source example inputinv_label
: specifies the desired output when the intervention should cause the attribute value to change to attribute value of the source entitylabel
: specifies the desired output when the intervention should isolate the attribute, i.e., having no causal effect on the output.
The rest of the fields are for tracking the intervention locations and aggreating metrics.
A demo on how to create evaluation data using TinyLlama as the target language model. The resulting dataset is used for evaluating the interpretability methods in the Quickstart demo.
We have implemented five families of interpretability methods in this repo:
- PCA
- Sparse Autoencoder (SAE)
- Linear adversarial probing (RLAP)
- Differential Binary Masking (DBM)
- Distributed Alignment Search (DAS)
- Multi-task extensions of DBM and DAS
You can find implementations of these methods in the src/methods directory.
Check out the demo in the Quickstart!
To evaluate a new interpretability method, one simply needs to convert the method into a bijective featurizer:
-
$\mathcal{F}$ , a function that takes in model representations and outputs (1) a set of features, e.g., a vector (2) a specification of which subset of features localize the target concept, e.g., a set of indicies. -
$\mathcal{F}^{-1}$ , a function that takes in the set of features produced by$\mathcal{F}$ and outputs the original model representations
The featurizer will then be evaluated with an interchange intervention as follows:
class InterventionWithNewFeaturizer(pv.TrainableIntervention):
"""Intervene in the featurizer output space."""
def __init__(self, **kwargs):
super().__init__()
# Define a bijective featurizer. A featurizer could be any callable object,
# such as a subclass of torch.nn.Module
self.featurizer = ...
self.featurizer_inverse = ...
# Specify which subset of features localize the target concept.
# For some methods, the intervention dimensions are implicitly defined by
# the featurizer function.
self.dimensions_to_intervene = ...
def forward(self, base, source):
base_features = self.featurizer(base)
source_features = self.featurizer(source)
# Apply interchange interventions.
base_features[..., self.dimensions_to_intervene] = source_features[..., self.dimensions_to_intervene]
output = self.featurizer_inverse(base_features)
return output
You can find examples of interventions with featurizers in src/methods, where AutoencoderIntervention
is an example that specifies which subset of features to intervene, while LowRankRotatedSpaceIntervention
is an example that implicitly defines features through the low-rank rotation matrix.
If you use our dataset or method implmentations, please consider citing the following work. For each interpretablity method, please also consider citing their original papers -- you can find a list of related work in each method section in our paper.
@inproceedings{huang-etal-2024-ravel,
title = "{RAVEL}: Evaluating Interpretability Methods on Disentangling Language Model Representations",
author = "Huang, Jing and
Wu, Zhengxuan and
Potts, Christopher and
Geva, Mor and
Geiger, Atticus",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.470",
pages = "8669--8687",
}
If you use the pyvene
framework, please also consider citing the following:
@inproceedings{wu-etal-2024-pyvene,
title = "pyvene: A Library for Understanding and Improving {P}y{T}orch Models via Interventions",
author = "Wu, Zhengxuan and Geiger, Atticus and Arora, Aryaman and Huang, Jing and Wang, Zheng and Goodman, Noah and Manning, Christopher and Potts, Christopher",
editor = "Chang, Kai-Wei and Lee, Annie and Rajani, Nazneen",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-demo.16",
pages = "158--165",
}