-
Notifications
You must be signed in to change notification settings - Fork 169
Home
Titipat Achakulvisut edited this page Jan 21, 2020
·
18 revisions
Here, we include how to set up PySpark with Pubmed Parser and on how to download PubMed Open-Access (PubMed OA) and MEDLINE dataset:
- Setup Spark 2.1
- Download and preprocess MEDLINE dataset
- Download and preprocess Pubmed Open-Access Subset
- Download PubMed OA figures
Here are links for downloading PubMed OA and MEDLINE data
- PubMed Open-Access (OA) dataset is available at
http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/
. Here is the FTP link for downloading the bulk of dataset. - the MEDLINE XMLs are available here
ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz/
- the MEDLINE XMLs weekly updates are available here
ftp://ftp.nlm.nih.gov/nlmdata/.medlease/gz/
- MEDLINE Document Type Definitions (DTDs) file is available at this link. We can use it to see available tags from a given MEDLINE XML.
- Please see copyright notice when you scrape data from website here
- MEDLINE Kung-Fu which uses medic to parse MEDLINE to database
- MEDLINEXMLToJSON implemented in JavaScript