An Automated Framework to Construct Datasets for Assessing Knowledge Editing or Multi-Hop Reasoning Capability of Language Models.
Table of Contents
The data stored in language models (LMs) quickly becomes obsolete, and retraining these models from the ground up is often not feasible. Recently, various methods (e.g. SERAC, IKE, MEND, KE, ROME, MEMIT, FT-L) have been developed to inject new knowledge.
Current methods mostly perform well in editing single atom facts, but they encounter catastrophic failures when tested on the ripple effects caused by the edited knowledge. For example, if we edit the information to state that the current President of the USA is Trump, then the answer to "Who is married to Trump?" should also change accordingly. While many datasets for evaluating knowledge editing of LMs exist, they predominantly focus on facts from Wikidata, primarily relating to people and events.
In other words, the data in these datasets is homogeneous and lacks diversity. Besides, This type of dataset construction pipeline often inevitably involves parts such as manual annotation and crowdsourcing, leading to significant time and economic costs. Therefore, I implemented a framework, AutoData, that can automatically construct datasets containing various types of data based on specific needs.
You should have at least one API key from a large language model, preferably from OpenAI.
git clone https://github.com/Leo-Lsc/AutoData.git
conda create -n AutoData python=3.11.8
cd AutoData
pip install -r requirements.txt
AutoData is a framework that uses the LangChain library and OpenAI's API to automatically construct customized datasets. AutoData consists of five modules: SubjectGenerator, QA_Generator, TripleExtractor, Interrupter and TwoHopQuestionGenerator.
If you have a suggestion that would make this better, please fork the repo and create a pull request. Any contributions you make are greatly appreciated. Don't forget to give the project a star! Thanks!
Leo-Lsc |
Please use the following citation if you intend to use AutoData:
@misc{AutoDataFramework,
title={AutoData},
author={Sicheng Lai},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Leo-Lsc/AutoData}},
}