Merge pull request r-three#56 from r-three/pubmedcentral

PubMed Central
baberabb · Apr 24, 2024 · 5d37388 · 5d37388
2 parents a71426f + 46df469
commit 5d37388
Show file tree

Hide file tree

Showing 11 changed files with 1,178 additions and 0 deletions.
diff --git a/pubmedcentral/.gitignore b/pubmedcentral/.gitignore
@@ -0,0 +1 @@
+data/*
diff --git a/pubmedcentral/README.md b/pubmedcentral/README.md
@@ -0,0 +1,52 @@
+# PubMed Central
+
+A collection of journal articles from [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/), "a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM)."
+
+## Data Download and Processing
+
+Downloading and processing code is easy. Simply use `bash run.sh` (or `bash run.sh 1000` for debugging with 1000 samples)
+
+<details>
+<summary>Under the hood of run.sh</summary>
+Run.sh has 3 main steps:
+
+1. Download the list of all articles with `bash get-filelist.sh`
+2. Download the data and convert from nxml to markdown with `bash download-and-convert-to-md.sh`
+3. Convert the data to the Dolma format with `python to-dolma.py`
+</details>
+
+Files converted to markdown will live in `data/md`, author lists live in `data/authors`, and processed files will live in `data/pubmedcentral`
+
+## Data Stats
+
+| # Articles | # Tokens |
+| ---------: | -------: |
+|    3997890 |          |
+
+## Example
+
+``` json
+{
+    "id": "PMC176545",
+    "text": "# Introduction {#s1}\n\nHuman malaria is caused by four species of the parasitic protozoan genus...",
+    "source": "PubMed Central",
+    "added": "2024-04-19T17:48:14.010842",
+    "created": "2003-8-18",
+    "metadata":
+        {
+            "license": "Creative Commons - Attribution - https://creativecommons.org/licenses/by/4.0/",
+           "url": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC176545/",
+           "journal": "PLoS Biol. 2003 Oct 18; 1(1):e5",
+           "authors": [{"first": "Zbynek", "last": "Bozdech"}, {"first": "Manuel", "last": "Llin\u00e1s"}, {"first": "Brian Lee", "last": "Pulliam"}, {"first": "Edith D", "last": "Wong"}, {"first": "Jingchun", "last": "Zhu"}, {"first": "Joseph L", "last": "DeRisi"}],
+    }
+}
+```
+
+## Notes
+Converting documents from nxml to markdown requires the pandoc library, which can be installed following the instructions on the [pandoc website](https://pandoc.org/installing.html).
+
+
+TODO:
+- [ ] Confirm article and token #s, fill in example
+- [ ] Handle references to figures and tables
+- [ ] Handle citations
diff --git a/pubmedcentral/download-convert-to-md.sh b/pubmedcentral/download-convert-to-md.sh
@@ -0,0 +1,6 @@
+#!/usr/bash/env sh
+
+TOTAL_DOCS="${1:-0}"
+
+# downloads each file from the filelist and converts the .nxml file to .md
+python3 download_and_convert_to_md.py --filelist data/permissive_filelist.txt --total_docs "${TOTAL_DOCS}"
diff --git a/pubmedcentral/download_and_convert_to_md.py b/pubmedcentral/download_and_convert_to_md.py
@@ -0,0 +1,201 @@
+import argparse
+import functools
+import json
+import multiprocessing as mp
+import os
+import re
+import shutil
+import subprocess
+import tarfile
+import traceback
+import xml.etree.ElementTree as ET
+
+from tqdm import tqdm
+
+from licensed_pile import logs
+from licensed_pile.scrape import get_page
+
+parser = argparse.ArgumentParser(description="Convert xml documents to markdown.")
+parser.add_argument("--filelist", help="The path to the filelist.txt file.")
+parser.add_argument(
+    "--output_dir", default="data/md/", help="Where the markdown files go."
+)
+parser.add_argument(
+    "--total_docs",
+    default=0,
+    type=int,
+    help="Total number of documents to convert, for debugging.",
+)
+parser.add_argument(
+    "--metadata_dir", default="data/metadata/", help="Where the metadata files go."
+)
+parser.add_argument(
+    "--processes",
+    default=mp.cpu_count(),
+    type=int,
+    help="Number of processes to use for conversion.",
+)
+
+
+def get_authors_and_date(nxml_file: str, pmcid: str):
+    # get authors from nxml file
+    authors = []
+    date_created = None
+
+    tree = ET.parse(nxml_file)
+
+    # search for author tags
+    for author in tree.findall(".//contrib[@contrib-type='author']"):
+        surname = author.find("name/surname")
+        given_names = author.find("name/given-names")
+        if surname is not None and given_names is not None:
+            authors.append({"first": given_names.text, "last": surname.text})
+
+    # get date
+    # date can be found under "epub" or "pmc-release" tags
+    pub_types = ["epub", "pmc-release"]
+    for pub_type in pub_types:
+        date = tree.find(f".//pub-date[@pub-type='{pub_type}']")
+        if date is not None:
+            year = date.find("year").text
+            month = date.find("month").text
+            day = date.find("day").text
+            # convert to YYYY-MM-DD format
+            date_created = f"{year}-{month}-{day}"
+            break
+
+    if date_created is None:
+        # haven't seen any examples without a date, but just in case
+        # not a fatal error, just log it
+        logger = logs.get_logger("pubmedcentral")
+        logger.error(f"Date not found for {pmcid}")
+
+    return authors, date_created
+
+
+def download(f_url: str, output_dir: str):
+    # download file from f_url to output_dir
+    try:
+        # get the tarball
+        r = get_page(f_url)
+
+        # write tarball to disk
+        with open(os.path.join(output_dir, f_url.split("/")[-1]), "wb") as fh:
+            fh.write(r.content)
+    except:
+        logger = logs.get_logger("pubmedcentral")
+        logger.error(f"Error downloading {f_url}")
+        logger.error(traceback.print_exc())
+
+
+def extract_and_convert_tarball(t: str, output_dir: str):
+    if not os.path.exists(t):
+        return
+    try:
+        with tarfile.open(t) as tar:
+            nxml = [f for f in tar.getnames() if f.endswith(".nxml")]
+
+            # make sure there's only one nxml file
+            if len(nxml) > 1:
+                # haven't seen an example with more than one nxml file, but just in case
+                error_message = f"More than one nxml file in {t}"
+                logger = logs.get_logger("pubmedcentral")
+                logger.error(error_message)
+                raise ValueError(error_message)
+            nxml = nxml[0]
+
+            # extract nxml file
+            tar.extract(nxml)
+
+        # get pmcid
+        pmcid = nxml.split("/")[0]
+
+        # get metadata from nxml file
+        authors, date_created = get_authors_and_date(nxml, pmcid)
+        metadata = {"authors": authors, "created": date_created}
+        # write to file
+        with open(
+            f"{os.path.join(args.metadata_dir, pmcid)}.json", "w", encoding="utf-8"
+        ) as f:
+            json.dump(metadata, f, ensure_ascii=False)
+
+        # convert nxml to markdown
+        # pandoc options:
+        #   --quiet is to suppress messages
+        #   --from jats specifies the input format as Journal Article Tag Suite (https://jats.nlm.nih.gov/)
+        #   -o is the output file
+        #   --wrap=none is to prevent pandoc from wrapping lines
+        options = [
+            "pandoc",
+            "--quiet",
+            "--from",
+            "jats",
+            nxml,
+            "-o",
+            f"{pmcid}.md",
+            "--wrap=none",
+        ]
+        subprocess.run(options)
+
+        # remove extracted files
+        os.rename(f"{pmcid}.md", f"{output_dir}/{pmcid}.md")
+        shutil.rmtree(nxml.split("/")[0], ignore_errors=True)
+
+    except:
+        logger = logs.get_logger("pubmedcentral")
+        logger.error(f"Error extracting {t}")
+        logger.error(traceback.print_exc())
+
+
+def download_and_convert(
+    line: str, output_dir: str, base_url="https://ftp.ncbi.nlm.nih.gov/pub/pmc/"
+):
+    # split line into parts
+    partial_path = line.split("\t")[0]
+
+    # create paths for the url and the destination of the markdown file
+    f_url = os.path.join(base_url, partial_path)
+    f_dest = os.path.join(output_dir, partial_path.split("/")[-1])
+
+    try:
+        download(f_url, output_dir)
+        extract_and_convert_tarball(f_dest, output_dir)
+
+        # delete the tarball
+        os.remove(f_dest)
+
+    except Exception as e:
+        logger = logs.get_logger("pubmedcentral")
+        logger.error(e)
+
+
+def main(args):
+    os.makedirs(args.output_dir, exist_ok=True)
+    os.makedirs(args.metadata_dir, exist_ok=True)
+
+    with open(args.filelist) as fh:
+        files = fh.read().split("\n")
+
+    # ignore the header
+    files = files[1:]
+
+    if args.total_docs > 0:
+        files = files[: args.total_docs]
+
+    with mp.Pool(args.processes) as p:
+        # use list to force the execution of the imap iterable within the context of the multiprocessing pool
+        _ = list(
+            tqdm(
+                p.imap(
+                    functools.partial(download_and_convert, output_dir=args.output_dir),
+                    files,
+                ),
+                total=len(files),
+            )
+        )
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    logs.configure_logging("pubmedcentral")
+    main(args)
diff --git a/pubmedcentral/example/PMC1149498.md b/pubmedcentral/example/PMC1149498.md
@@ -0,0 +1,37 @@
+Almost exactly five years ago, in early June 2000, BMC Bioinformatics received its first submission. Five years on, it has received over a thousand submissions, and the journal is continuing to grow rapidly (Figure [1](#F1){ref-type="fig"}).
+
+In the past few months, developments have included a refreshed international [editorial board](http://www.biomedcentral.com/bmcbioinformatics/edboard), which now consists of over 50 leaders in the field, and a [Bioinformatics and Genomics gateway](http://www.biomedcentral.com/gateways/bioinformaticsgenomics/) that brings together relevant content from across BioMed Central\'s 130+ Open Access journals. And by the time you read this, *BMC Bioinformatics* should have its first official ISI Impact Factor. Impact factors certainly have their problems -- a previous editorial in this journal\[[@B1]\] discussed the arbitrariness of the process by which ISI selects journals for tracking, and the resulting unnecessary time delay before Impact Factors become available. One thing is clear though -- with *BMC Bioinformatics* having an Impact Factor, there are more reasons than ever to make it the first choice for your research.
+
+# Five years in bioinformatics
+
+Looking back over the first 5 years of the journal, are any significant trends evident? One thing that is noticeable is the prevalence of the open-source model of software development. In fact more than 10% of all BMC Bioinformatics articles include the term \"open-source\". Hundreds of open-source bioinformatics projects are now hosted on sites such as [bioinformatics.org](http://bioinformatics.org) and [sourceforge.net](http://sourceforge.net/softwaremap/trove_list.php?form_cat=252). No doubt the similar philosophies of open-source software and Open Access publishing have been a factor in making *BMC Bioinformatics* one of BioMed Central\'s most successful journals. Two other emerging trends are, firstly, an increasing use of web service technology to connect disparate tools into analysis pipelines, and secondly, the development of systems to allow biological knowledge to be modelled and expressed in structured form. The linking factor between both these trends is that increasingly, as the data deluge continues, the \'users\' of bioinformatics tools and the \'readers\' of the biological literature, are likely to be computer systems rather than human beings.
+
+# Web services and data analysis pipelines
+
+As bioinformatics tools have proliferated, the complexity of data analysis has increased. Often, a sequence of analysis steps each using different tools must be carried out one after the other. This might be done manually or by using a monolithic system that is capable of carrying out multiple analyses, or, more flexibly, by writing special \'glue code\', often in Perl, to connect together multiple tools into a pipeline. The problem with the latter approach, though, is that in the absence of defined standards for the input and output of different tools, lots of glue code has to be written in order to create each new pipeline. Worse, systems built in this way tend to be fragile, since at any time one of the tools in the pipeline may change the format of its input or output (breaking the system), because there is no explicit \'contract\' between the various tools as to what input and output formats each will support. Web services \[[@B2]\], and more generally, \'Service Oriented Architectures\' \[[@B3]\] promise to provide a solution by providing a means for codifying standard interfaces that can be used to expose bioinformatics tools over the web. Projects such as MyGrid \[[@B4]\] have then built on these standards to provide biologists with graphical user interfaces that can be used to build new analysis pipelines interactively, without needing to write code. *BMC Bioinformatics* has published several articles on the use of Web Service technologies such as the Simple Object Access Protocol (SOAP) - if you are interested, try searching the journal for: [SOAP OR \"web services\"](http://www.biomedcentral.com/search/results.asp?drpField1=&txtSearch1=SOAP+OR+%22web+services%22&drpPhrase1=and&drpField2=%5BTI%5D&txtSearch2=&drpPhrase2=and&drpField3=%5BAU%5D&txtSearch3=&drpPhrase3=and&drpField4=%5BTIAB%5D&txtSearch4=&drpPhrase4)
+
+# Text mining and biological semantics
+
+Another growth area in bioinformatics has been the structured representation and modelling of biological knowledge. The Gene Ontology project \[[@B5]\] has provided an important foundation for much of this work, defining a set of controlled vocabularies that allow biological concepts and relationships to be expressed in a standard way.
+
+Much of the initial work on modelling biological knowledge has explored the use of text-mining techniques to automatically derive structured semantic information from the relatively unstructured text of scientific research articles. [BioMed Central\'s Open Access corpus](http://www.biomedcentral.com/info/about/datamining)\[[@B6]\] is now rapidly approaching 10,000 articles and provides ideal raw material for such research.. It is already being used by many researchers, both in industry and academia.
+
+*BMC Bioinformatics* publishes many papers on text-mining topics, including the recently published supplement \[[@B7]\], which consists of papers presented at last year\'s BioCreAtIvE text-mining workshop in Granada, Spain. Text mining has its limits, however. Imagine what could be achieved if articles, rather than consisting entirely of free-form natural language, contained explicit assertions about biological knowledge in unambiguous, machine-readable form. This is the oft-vaunted promise of the 'Semantic Web' \[[@B8]\], but it has proved to be very difficult to realize in practice.
+
+Some recent developments, however, suggest that progress is being made. For example, this editorial was created using [Publicon](http://www.biomedcentral.com/info/ifora/publicon)\[[@B9]\]- a new breed of scientific authoring tool developed by Wolfram Research with input from BioMed Central. Publicon is easy to use, but it is also a highly structured authoring environment. It can not only output BioMed Central\'s native article XML format, but also embed mathematical equations as \'islands\' of semantically-rich MathML \[[@B10]\]. This structured mathematical information is then preserved throughout the publication process, from the author\'s computer right through to the reader\'s desktop with no intermediate unstructured version along the way that might cause information to be lost.
+
+So, for example, if you are accessing this editorial online using a suitable browser, you should be able to cut and paste the equation below into any MathML-aware application, as a mathematically meaningful equation rather than an image.
+
+$${\left( {i~{\nabla{- m}}} \right)~{\Phi_{e^{2}}\left\lbrack {B,x} \right\rbrack}} = {{{B(x)}~{\Phi_{e^{2}}\left\lbrack {B,x} \right\rbrack}} + {i~e^{2}~\gamma_{\mu}~\left. \int{{\delta_{+}\left( s_{x~1}^{2} \right)}~\left( {{{{\delta\Phi}_{e^{2}}\left\lbrack {B,x} \right\rbrack}/\delta}~{B_{\mu}(1)}} \right)~{\mathbb{d}\tau_{1}}} \right.}}$$
+
+In two accompanying Commentaries, the issues associated with capturing and representing biological knowledge are discussed further. Murray-Rust *et al.*\[[@B11]\] consider how chemical information can best be represented within scientific articles, and what bioinformaticists and chemists can learn from one another. Meanwhile, Mons \[[@B12]\] explores in more detail how smart authoring tools can enrich the scientific literature by allowing authors to express themselves unambiguously, avoiding the \'data burying\' that makes text mining necessary in the first place.
+
+::: {#refs}
+:::
+
+## Figures and Tables
+
+<figure id="F1">
+<p><img src="1471-2105-6-140-1" /></p>
+<figcaption>Number of submissions to BMC Bioinformatics. The figure for 2005 represents a conservative projection based on the rate of growth of submissions during the first half of the year.</figcaption>
+</figure>