Skip to content

Supported Input Formats

Gus Hahn-Powell edited this page Oct 13, 2016 · 6 revisions

Input formats

Reach can read papers in .nxml, .txt, or .tsv format.

.nxml files

Nearly all of the papers in the Open Access subset of PubMed can be retrieved as .nxml.

As an example, we'll retrieve the nxml for PMC1234 and PMC1235. We'll use a python script to do this.

wget https://gist.githubusercontent.com/myedibleenso/f233359445461a71ad37017393fe921f/raw/982275ad8d5070e8c0bc5c07edcfec1cd804c611/fetch_nxml.py

python fetch_nxml.py --pmcids PMC1234 PMC1235

.tsv files

The template for .tsv files can be retrieved with the following command:

wget https://gist.githubusercontent.com/myedibleenso/fb1f858a5664e12ff0448f4468b60842/raw/4eab1991eae4c89b1d5dffcb8c317bcd2f3cadd1/input-template.tsv

The first three columns of the .tsv files should be 1) the paper's name, 2) the name of the section, and 3) the text for that section.

NOTE: Include a header each .tsv file. By default, the system will drop the first row of the file when reading, since it expects this to be the header.

.txt files

You can simply dump the raw text that Reach should read into a .txt file. Note that Reach will not attempt extensive preprocessing of such files, so you are advised to perform your own cleanup of the file first (removing LaTeX, acknowledgements, references, etc.).