Skip to content

An Automated Framework to Construct Datasets for Assessing Knowledge Editing or Multi-Hop Reasoning Capability of Language Models.

License

Notifications You must be signed in to change notification settings

Leo-Lsc/AutoData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

AutoData

An Automated Framework to Construct Datasets for Assessing Knowledge Editing or Multi-Hop Reasoning Capability of Language Models.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Overview
  4. Contributing

About The Project

The data stored in language models (LMs) quickly becomes obsolete, and retraining these models from the ground up is often not feasible. Recently, various methods (e.g. SERAC, IKE, MEND, KE, ROME, MEMIT, FT-L) have been developed to inject new knowledge.

Current methods mostly perform well in editing single atom facts, but they encounter catastrophic failures when tested on the ripple effects caused by the edited knowledge. For example, if we edit the information to state that the current President of the USA is Trump, then the answer to "Who is married to Trump?" should also change accordingly. While many datasets for evaluating knowledge editing of LMs exist, they predominantly focus on facts from Wikidata, primarily relating to people and events.

In other words, the data in these datasets is homogeneous and lacks diversity. Besides, This type of dataset construction pipeline often inevitably involves parts such as manual annotation and crowdsourcing, leading to significant time and economic costs. Therefore, I implemented a framework, AutoData, that can automatically construct datasets containing various types of data based on specific needs.

Getting Started

Prerequisites

You should have at least one API key from a large language model, preferably from OpenAI.

Pip Installation

git clone https://github.com/Leo-Lsc/AutoData.git
conda create -n AutoData python=3.11.8
cd AutoData
pip install -r requirements.txt

Overview

AutoData is a framework that uses the LangChain library and OpenAI's API to automatically construct customized datasets. AutoData consists of five modules: SubjectGenerator, QA_Generator, TripleExtractor, Interrupter and TwoHopQuestionGenerator.

Contributing

If you have a suggestion that would make this better, please fork the repo and create a pull request. Any contributions you make are greatly appreciated. Don't forget to give the project a star! Thanks!

Contributors

Leo-Lsc
Leo-Lsc

Citation

Please use the following citation if you intend to use AutoData:

@misc{AutoDataFramework,
  title={AutoData},
  author={Sicheng Lai},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Leo-Lsc/AutoData}},
}

About

An Automated Framework to Construct Datasets for Assessing Knowledge Editing or Multi-Hop Reasoning Capability of Language Models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages