We provide a codebase for "DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models" accepted at ICLR 2024. DataInf is an efficient influence approximation method that is practical for large-scale generative AI models such as LLMs or stable diffusion models. DataInf leverages an easy-to-compute closed-form expression, outperforming existing influence computation algorithms in terms of computational and memory efficiency.
An easy-to-start Jupyter notebook notebooks/Mislabeled_Data_Detection-RoBERTa-MRPC.ipynb
demonstrates how to compute the influence function values and how to detect mislabeled data points using the computed influence function values.
- We use the RoBERTa-large model and LoRA, a parameter-efficient fine-tuning technique, to significantly reduce the total number of parameters.
- We consider a noisy version of the GLUE-MRPC dataset; We synthetically generate mislabeled data points by flipping the label of data points. We randomly selected 20% of data points.
A Jupyter notebook notebooks/Influential_Data_Identification-Llama2-Math-Reason.ipynb
demonstrates how to efficiently compute the influence function values, showing its applications to identify most influential data points. We use the llama2-13b-chat. It has the following steps.
- Step 1 Dataset generation: generate the
math_problem (with reasoning)
dataset with the following bash command. It will be stored at thedatasets
folder.
python3 src/generate_sentence-math_datasets.py
It will generate the sentence_transformation
and math_problem (without reasoning) datasets as well.
- Step 2 Fine-tune a model: fine-tune a llama-2-13b-chat model on the
math problem (with reasoning)
dataset. We usesrc/sft_trainer.py
, which is built on HuggingFace's SFTTrainer. A sample CLI is given as follows.
python /YOUR-DATAINF-PATH/DataInf/src/sft_trainer.py \
--model_name /YOUR-LLAMA-PATH/llama/models_hf/llama-2-13b-chat \
--dataset_name /YOUR-DATAINF-PATH/DataInf/datasets/math_with_reason_train.hf \
--output_dir /YOUR-DATAINF-PATH/DataInf/models/math_with_reason_13bf \
--dataset_text_field text \
--load_in_8bit \
--use_peft
- Step 3 Compute the gradients and influence function values.
-
dataloader.py
includes the construction of tokenizers and generates noisy datasets. -
lora_model.py
includes LoRA modules. -
influence.py
includes influence computation algorithms. -
generate_sentence-math_datasets.py
generates the sentence_transformation and the math problem datasets.
We also provide a CLI tool. The following command will compute the influence function values of the GLUE-QNLI dataset. It uses the RoBERTa-large model and the LoRA rank is set to 8.
python3 launcher.py run --exp_id='config_qnli4' --run-id=0 --runpath='./'
If you found the library or the paper useful, please cite us!
@article{kwon2023datainf,
title={Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models},
author={Kwon, Yongchan and Wu, Eric and Wu, Kevin and Zou, James},
journal={arXiv preprint arXiv:2310.00902},
year={2023}
}