Skip to content

Latest commit

 

History

History
56 lines (40 loc) · 1.56 KB

README.md

File metadata and controls

56 lines (40 loc) · 1.56 KB

Unpaywall

This repository contains an unpaywall python wrapper that downloads metadata and raw_pdf for a given DOI as well as a bash wrapper that runs s2orc-doc2json utility to parse pdfs into jsons.

You need to have Python, Java, and Bash installed on your system in order to use it.

Installation

Begin by cloning the repo, so you can get the required files:

git clone https://github.com/hcss-utils/unpaywall.git
cd unpaywall
git submodule update --init --recursive

In your terminal, you should now be located in your unpaywall folder.

Let's install virtual environment:

Linux/MacOS:

python3 -m venv env
source env/bin/activate

Now let's install dependencies:

pip install -r requirements.txt
pip install -e .
pip install -r s2orc-doc2json/requirements.txt
pip install -e s2orc-doc2json

If this command runs without any error messages, you can then move onto the next step, which is installing Java as well as Grobid server.

Once you have Java installed (look it up in google), run the following scripts:

bash s2orc-doc2json/scripts/setup_grobid.sh 
bash s2orc-doc2json/scripts/run_grobid.sh # after 87% it's not stuck - you could use grobid already

See s2orc-doc2json for more information.

Usage

Update lens-scopus-wos.csv. Then execute run.sh to parse pdfs into json (make sure you still have grobid running in another terminal tab):

cd scripts
bash run.sh