Skip to content

Latest commit

 

History

History
68 lines (54 loc) · 3.01 KB

README.md

File metadata and controls

68 lines (54 loc) · 3.01 KB

COVID-19 To Text Corpus

Kaggle has provided an excelent data source for the COVID-19 courtesy of AI2 The purpose of this repo is to convert it from the given format into the normal text corpus format. I.E. one document per file, one sentence per line, pargraphs have a blank line between them.

Prerequisites

The following packages need to be installed. I recommend using Chocolatey.

if('Unrestricted' -ne (Get-ExecutionPolicy)) { Set-ExecutionPolicy Bypass -Scope Process -Force }
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
refreshenv

choco install 7zip.install -y
choco install python3 -y

Modules

All scripts have been tested on Python 3.8.2. The below modules are need to run the scripts. The scripts were tested on the noted versions, so YMMV. Note: not all modules are required for all scripts. If this it the first time running the scripts, the modules will need to be installed. They can be installed by navigating to the ~/code folder, then using the below code.

  • nltk 3.4.5
  • progressbar2 3.47.0
pip install -r requirments.txt
python -c "import nltk;nltk.download('punkt')"

Steps

The below document describes how to recreate the text corpus. It assumes that a particular path structure will be used, but the commands can be modified to target a different directory structure without changing the code. I am choosing the d:/covid19 directory because my d drive is big enough to hold everything.

  1. Clone this repo then open a shell to the ~/code directory.
  2. Retrieve the dataset by hand. Click on the download link, saving the file to a know location.
  3. Extract the data in-place with no folder structure.
    • The e switch flattens the extract so the custom code does not need to recursivaly search the folder structure.
"C:/Program Files/7-Zip/7z.exe" e -od:/covid19/raw "d:/covid19/*.zip"
  1. Extract the meta-data. This will create a single metadata.csv containing some useful information. In general this would be used as part of segementation or as part of a MANOVA.
python extract_metadata.py -in d:/covid19/raw -out d:/covid19/metadata.csv
  1. Convert the raw JSON files into the nomal folder corpus format. This will create a text corpus folder at the location I.E. ./corpus containing 2 sub folders, one for the abstract and one for the body. Some of the files provide by Kaggle are not full text articles I.E. empty abstract or body. These incomplete files are filtered out of the final folders and noted in error.csv
python convert_to_corpus.py -in d:/covid19/raw -out d:/covid19/corpus