Skip to content

explanare/verbatim-memorization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Demystifying Verbatim Memorization in Large Language Models

🚧 Work in Progress 🚧

Verbatim memorization refers to LLMs outputting long sequences of texts that are exact matches of their training examples. In our work, we show that verbatim memorization is intertwined with the LM's general capabilities and thus will be very difficult to isolate and suppress without degrading model quality.

This repo contains:

  • A framework to study verbatim memorization in a controlled setting by continuing pre-training from LLM checkpoints with injected sequences.
  • Scripts using causal interventions to analyze how verbatim memorized sequences are encoded in the model representations.
  • Stress testing evaluation for unlearning methods that aim to remove the verbatim memorized information.

Data

The data directory contains the following datasets:

  • Pile data: 1M sequences sampled from the Pile, along with continuations generated by the pythia-6.9b-deduped model.
  • Sequence injection data: 100 sequences sampled from Internet content published after Dec 2020.
  • Stress testing data: 140K perturbed prefixes to evaluate whether unlearning methods truly remove the verbatim memorized information.

Experiment

Training with the Sequence Injection Framework

The pre-training data can be generated by the batch_viewer script, which allows you to extract Pythia training data between two given training steps.

The training script is at scripts/train_with_injection.py. For the single-shot verbatim memorization experiment, the training script is at scripts/train_with_injection_single_shot.py.

Analyzing Causal Dependencies Between the Trigger and Verbatim Memorized Tokens

We use causal interventions to analyze the causal dependencies between the trigger and verbatim memorized tokens. You can find the script for causal dependency analysis on Colab:

Open In Colab

Below is an example of a sequence verbatim memorized by pythia-6.9b-deduped, which is the first sentence of the book Harry Potter and the Philosopher's Stone. The trigger sequence is "Mr and Mrs Dursley, of", i.e., the model can generate the full sentence given only the trigger. Yet, not all generated tokens are actually causally dependent on the trigger, e.g., the prediction of the token "you" only depends on representations of the token "thank".

Causal dependencies between the trigger and verbatim memorized tokens.

Stress Testing Unlearning Methods

The evaluation scripts, including generating perturbed prefixes, are available below:

Open In Colab

Citation

If you find this repo helpful, please consider citing our work

@misc{huang2024demystifying,
      title={Demystifying Verbatim Memorization in Large Language Models}, 
      author={Jing Huang and Diyi Yang and Christopher Potts},
      year={2024},
      eprint={2407.17817},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.17817}, 
}

About

Demystifying Verbatim Memorization in Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Languages