GitHub - faizan1234567/NewsQA: News QA generation and fine tuning an LLM for QA generation (under development)

NewsQA: News Dataset for QA Generation

This repository contains a large dataset of news articles scraped from various Pakistani news websites. The dataset covers diverse categories including:

Politics
Sports
Fashion & Style
International News
Domestic Affairs
Science & Technology

Data Collection and QA Generation

We evaluated several large language models (LLMs) for generating question-answer pairs from the scraped news articles:

Llama2: Generates high-quality question-answer pairs but is relatively slow.
T5-small: Fast but less accurate, often producing duplicate question-answer pairs.
GPT-3.5 Turbo and GPT-4: Effective for generating high-quality question-answer pairs efficiently.

Findings and Dataset

Our case study revealed that while Llama2 offers the best quality, it is slower compared to GPT models. T5-small, though fast, has limitations in accuracy and duplication. Consequently, we used GPT-3.5 Turbo and GPT-4 to generate a more substantial dataset.

This dataset is open-source and can be used for:

Fine-tuning LLMs
Evaluating model performance

Additionally, we have fine-tuned Tiny Llama on this dataset.

QA Generated Dataset Examples

LLaMA2

T5-small

Question	Answer
What is Pakistan's official name?	Islamic Republic of Pakistan.
How many people live in Pakistan?	Over 241.5 million as of 2023.
What is the capital of Pakistan?	Islamabad.
What is the largest city and financial center of Pakistan?	Karachi.

Question	Answer
What is the capital city of Sindh?	Karachi
What is the population of Karachi?	over 20 million
Where is Karachi located?	southern tip of the country along the Arabian Sea coast
What is the capital city of Pakistan?	Islamabad

GPT-3.5-Turbo

GPT-4

Question	Answer
What inspired the founding of LAPS?	The first rescued animal, a pit bull named Lucky.
How many dogs are currently housed at LAPS?	Nearly 300 dogs.
How many stray animals have been vaccinated by LAPS so far?	Over 5,000 stray animals.
How many dogs and cats have been neutered by LAPS?	More than 3,000 dogs and cats.

Question	Answer
What are monopolistic seed companies doing to consumers?	Charging heavy costs.
How are farmers being facilitated in operating tube wells?	By using solar energy.
What steps are proposed to materialize a green revolution in the country?	Direct fertiliser subsidy, quality seeds supply, and solar-powered tube-wells.
How would the mentioned steps impact productivity?	Productivity would triple in a couple of years.

GPT3.5-Turbo and GPT4 generates desired response. Fig. Gradio demo using T5-small

Installation

 git clone https://github.com/faizan1234567/QALLM.git
 cd QALLM

Create a virtual enviroment using python venv

python3 -m venv qa_llm
source qa_llm/bin/activate

alternatively, you can use anaconda package manager

conda create -n qa_llm python=3.8.10 -y
conda activate qa_llm

Now install all the required dependencies

pip install --upgrade pip
pip install -r requirements.txt

Usage

QA generation, make sure to read and understand the configs and replace appropriate values as required.

python create_alpaca_format_dataset.py --chunk_size 5000 --dataset <path>

and run QA generation

python qa_generator.py --model T5-small --cfg cfg/qa_generator.yaml

And there is a run_qa_llm_repo.ipynb under notebooks directory to install and run the QA on google colab, kaggle, Gradient, or local machine with GPU.

if you find the dataset useful for fine-tuning, research, and development purposes, please star & cite the repo:

Contributors

Muhammad Faizan and Sana Zafar

@misc{QALLM,
    title={NewsQA: News Dataset for QA Generation},
    authors={Muhammad Faizan and Sana Zafar},
    howpublished = {\url{https://github.com/faizan1234567/QALLM}},
    year={2024}
}

ToDo

QA dataset generation using Llama2 and T5-small
QA dataset generation using GPT-3.5 Turbo and GPT4
Scrapping News articles from Pakistan based News channels
Creating a Large fine-tuning dataset in Alpaca format
Add installation / virtual environment instructions
fine-tuing Tiny-llama, Mistral, and Llama3 on generated dataset
Evaluation
Complete ChatBot for QA generation

Acknowledgements

[1]. A fast and powerful scraping and web crawling framework. Scrapy. (n.d.). https://scrapy.org/

[2]. https://huggingface.co/TheBloke/Llama-2-70B-GGML. (n.d.).

[3]. Ushio, A., Alva-Manchego, F., & Camacho-Collados, J. (2023). An empirical comparison of LM-based question and answer generation methods. arXiv preprint arXiv:2305.17002.

[4]. OpenAI’s GPT-3.5 Turbo, platform.openai.com/docs/models/gpt-3-5-turbo. Accessed 28 July 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github		.github
cfg		cfg
dataset		dataset
images		images
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_alpaca_format_dataset.py		create_alpaca_format_dataset.py
main.py		main.py
qa_generator.py		qa_generator.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NewsQA: News Dataset for QA Generation

Data Collection and QA Generation

Findings and Dataset

QA Generated Dataset Examples

Installation

Usage

Contributors

ToDo

Acknowledgements

About

Releases

Packages

Languages

License

faizan1234567/NewsQA

Folders and files

Latest commit

History

Repository files navigation

NewsQA: News Dataset for QA Generation

Data Collection and QA Generation

Findings and Dataset

QA Generated Dataset Examples

Installation

Usage

Contributors

ToDo

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages