Skip to content

A repository documenting how to gather the data used to train RoBERTa

Notifications You must be signed in to change notification settings

adriansahlman/RoBERTa_training_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RoBERTa Training-Data

This repository contains instructions and utilities for gathering the data used to train the RoBERTa models (according to the paper).

All links and instructions can be found in this README.

The code and documentation of this repository is of low quality. It is more like a documentation of how I gathered the data in a quick and dirty way in case I need to redo it in the future.

NOTE: This is a work in progress and I will add/finish documentation for all datasets in the future.

The final text-files will have the following format:

3 empty lines = doc break
2 empty lines = section break
1 empty line = paragraph break
1 sentence per line

BookCorpus

This dataset is no longer available and the website that hosts the books have implemented some kind of measure to prevent scraping. A copy of the dataset (already preprocessed) exists. Download it here.

WARNING: This data is all lower cased as well as missing paragraph, section and document breaks.

Unpack the downloaded data.

Use bookscorpus.py to preprocess and split data. Lets use 4 000 000 lines as validation data and the rest as training data.

NOTE: Preprocessing attempts to remove the tokenization already in place.

Example:

python bookscorpus.py -i /path/to/bookscorpus/books_large_p1.txt /path/to/bookscorpus/books_large_p2.txt -o /path/to/destination_dir --splits 4000000 -1

This creates split1.txt with 4M lines and puts the remaining lines of the input data in split2.txt.

mv /path/to/destination_dir/split1.txt path/to/valid/bookscorpus.valid.txt
mv /path/to/destination_dir/split2.txt path/to/train/bookscorpus.train.txt

Preprocessing creates files with the following format:
1 sentence per line

Wikipedia

Get the latest wikipedia dump

Clone the repository wikiextractor

python WikiExtractor.py /path/to/enwiki-latest-pages-articles.xml.bz2 -o /path/to/extracted --sections --filter_disambig_pages --min_text_length 200 --bytes 1G

--bytes 1G makes for less amount of files created as each file can be up to 1 GB.

Preprocess and concat wikipedia text into single file using wikipedia.py Specify split in same way as bookscorpus.py, only difference is that it is articles, not lines.

In the wikipedia dump from april 2020 I found 4899923 articles after filtering.

Example:

# Preprocess and split into 2 files with 300,000 articles in the first and the remaining in the second.
python wikipedia.py -i /path/to/wikipedia/extracted/ -o /path/to/destination/ --splits 300000 -1

Preprocessing creates files with the following format:
3 empty lines = doc break 2 empty lines = section break 1 empty line = paragraph break 1 sentence per line

CC-NEWS

install aws cli https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html

clone the repository news-please

pip install -r requirements.txt
For me this package was not properly installed so I had to do it manually:
pip install hurry.filesize

Change newsplease/examples/commoncrawl.py to have a config that is somewhat similar to this:

############ YOUR CONFIG ############
# download dir for warc files
my_local_download_dir_warc = './cc_download_warc/'
# download dir for articles
my_local_download_dir_article = './cc_download_articles/'
# hosts (if None or empty list, any host is OK)
my_filter_valid_hosts = []  # example: ['elrancaguino.cl']
# start date (if None, any date is OK as start date), as datetime
my_filter_start_date = datetime.datetime(2016, 9, 1)
# end date (if None, any date is OK as end date), as datetime
my_filter_end_date = datetime.datetime(2019, 2, 28)
# if date filtering is strict and news-please could not detect the date of an article, the article will be discarded
my_filter_strict_date = True
# if True, the script checks whether a file has been downloaded already and uses that file instead of downloading
# again. Note that there is no check whether the file has been downloaded completely or is valid!
my_reuse_previously_downloaded_files = True
# continue after error
my_continue_after_error = True
# show the progress of downloading the WARC files
my_show_download_progress = True
# log_level
my_log_level = logging.INFO
# json export style
my_json_export_style = 0  # 0 (minimize), 1 (pretty)
# number of extraction processes
my_number_of_extraction_processes = 28
# if True, the WARC file will be deleted after all articles have been extracted from it
my_delete_warc_after_extraction = True
# if True, will continue extraction from the latest fully downloaded but not fully extracted WARC files and then
# crawling new WARC files. This assumes that the filter criteria have not been changed since the previous run!
my_continue_process = True
############ END YOUR CONFIG #########

Change the callback function to filter out any non-english language:

def on_valid_article_extracted(article):
    """
    This function will be invoked for each article that was extracted successfully from the archived data and that
    satisfies the filter criteria.
    :param article:
    :return:
    """
    # do whatever you need to do with the article (e.g., save it to disk, store it in ElasticSearch, etc.)

    if article.__dict__.get('language', None) == 'en':
        with open(__get_pretty_filepath(my_local_download_dir_article, article), 'w', encoding='utf-8') as outfile:
            if my_json_export_style == 0:
                json.dump(article.__dict__, outfile, default=str, separators=(',', ':'), ensure_ascii=False)
            elif my_json_export_style == 1:
                json.dump(article.__dict__, outfile, default=str, indent=4, sort_keys=True, ensure_ascii=False)

Run the script (this takes multiple days to complete).

python -m newsplease.examples.commoncrawl

WEBTEXT

After testing, this data needs some heavy cleaning before it can be used. The results contain multiple languages, scrapes all kinds of paragraphs from sidebars and such, as well as running into sites that are blocked unless using cookies. I do not have the time to fix this, feel free to contact me if you have a better way of using this data.

Download the URL files
Use the script webtext.py to download and save the content of the urls.

STORIES

Install gsutil

curl https://sdk.cloud.google.com | bash

restart shell

Download data

gsutil cp -R gs://commonsense-reasoning/reproduce/stories_corpus/* /path/to/destination/

The data contains: 947,260 docs 404,265,586 lines

Process and split using stories.py (same command line args as for wikipedia.py).

# Create two splits, one with 47 000 documents and one with the rest
python stories.py -i /path/to/stories_corpus/ -o /path/to/processed/ -s 47000 -1

Preprocessing creates files with the following format:
3 empty lines = doc break 1 sentence per line

About

A repository documenting how to gather the data used to train RoBERTa

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages