PubMed Central #56

alon-albalak · 2024-02-06T01:08:01Z

This PR contains the initial code for downloading/extracting/converting PubMed Central articles.

It introduces 3 main files:

get-filelist.sh (downloads the metadata for all pubmed articles and filters out data with non-permissive licenses)
download-convert-to-md.sh (downloads individual pubmed articles as nxml files and converts them to md)
to-dolma.py (formats the articles in the dolma style)

To test the code without running over the full dataset, change line 4 in download-convert-to-md.sh to python3 download_and_convert_to_md.py --filelist data/permissive_filelist.txt --total_docs 1000

craffel · 2024-02-06T15:30:21Z

Thanks a lot, @alon-albalak . Can you modify download-convert-to-md.sh to take in a flag that specifies the total docs so that it can be tested without modifying the script? Also, do you mind posting an example .md (ideally of an article that has tables and equations) to inspect?

alon-albalak · 2024-02-06T18:13:06Z

Thanks for taking a look @craffel ! I modified the download-convert-to-md.sh script as you requested.

For some example .md files, check out PMC553995.md and PMC544568.md for tables that were formatted nicely (you'll have to look at the raw files because github hides the table).
See PMC509239.md for a table that is pretty poorly formatted because it is meant to have many figures inside the table. Tables 1 and 2 look pretty rough, but Tables 3 and 4 came out okay.
See PMC1149498.md for a fairly complex equation that couldn't be formatted as native markdown, and was left as LaTeX, and the modelling section of PMC544568.md for some equations formatted in html/markdown.

As a side note, some files also include supplemental tables as attachments which I didn't integrate into the text because they are pretty infrequent (only 13 out of 10000 files have a supplemental table) and very large when they do exist (on average 1.5M characters). We definitely could include them inline but not sure it would be useful.

craffel · 2024-02-06T19:16:35Z

Huh, yeah, those seem pretty reasonable. It's a little weird that it includes image tags - I'm not sure that's something that would be helpful to model, maybe we should try to strip them out? Also, regarding the markdown formatting, I think roughly speaking we should aim to have umarked text. Is there a nice pipeline to go from markdown to unmarked text (with the exception of, perhaps, equations and tables)? Thanks.

blester125

This is a great start thanks!

I made some comments that should helpfully help the reliability of code and reduce the number of subprocess calls.

Did you explore using pandoc as part of a dolma parallel processor? IIUC dolma has functionality to automatically parallelize/get the most out of hardware/scale to multiple worker settings. As it is now, if the pandoc part gets to be a bottleneck it seems like it could be hard optimize.

pubmedcentral/PMC1149498.md

pubmedcentral/download-convert-to-md.sh

pubmedcentral/download_and_convert_to_md.py

pubmedcentral/to-dolma.py

blester125 · 2024-02-08T17:46:36Z

Huh, yeah, those seem pretty reasonable. It's a little weird that it includes image tags - I'm not sure that's something that would be helpful to model, maybe we should try to strip them out? Also, regarding the markdown formatting, I think roughly speaking we should aim to have umarked text. Is there a nice pipeline to go from markdown to unmarked text (with the exception of, perhaps, equations and tables)? Thanks.

I think things like this would probably be best implemented as a second step in the pipeline, one that uses the dolma parallel processors, wholes whole job would be converting the markdown into an even plainer text format.

alon-albalak · 2024-04-17T15:10:18Z

Did you explore using pandoc as part of a dolma parallel processor? IIUC dolma has functionality to automatically parallelize/get the most out of hardware/scale to multiple worker settings. As it is now, if the pandoc part gets to be a bottleneck it seems like it could be hard optimize.

I did not consider this. Do you have an example of how that would look?

blester125

One tiny tweak and then I think it ready to merge! Great Job on this!

pubmedcentral/download_and_convert_to_md.py

alon-albalak · 2024-04-19T18:02:46Z

Updated the author list as List[{"first": str, "last": str}]

alon-albalak added 7 commits January 30, 2024 08:51

code for downloading and converting pubmedcentral to dolma

2b381f5

add a check for aggregated filelist

061979a

get data for all dumps

3aca009

fix check for aggregating filelists

33afcb7

fix style

9e98945

add debugging option, set download and extract to quiet by default

8b51578

remove old scripts

f655672

alon-albalak assigned blester125 and nkandpa2 and unassigned blester125 and nkandpa2 Feb 6, 2024

alon-albalak requested review from blester125 and nkandpa2 February 6, 2024 01:16

fix style

331d392

alon-albalak added 4 commits February 6, 2024 09:07

add cli option for total_docs when debugging

d521114

add processed examples

544b1cd

add processed example

8a2dd7f

add equation example

e03d932

blester125 requested changes Feb 8, 2024

View reviewed changes

alon-albalak added 7 commits March 17, 2024 17:33

move example files to example folder

1f631a1

wrap interpolations in quotes

aa8f561

by default send markdown files to md folder

5201fb0

Merge branch 'main' into pubmedcentral

484d49e

rename output dir, rename variables for ease of understanding

9c32116

Merge branch 'main' into pubmedcentral

eae0254

fix black formatting

cbb92b4

alon-albalak added 7 commits April 16, 2024 11:55

improve logging for nxml error

90f0827

replace os.system calls with native python functions

ea1edac

simplify variables

00f604d

add flag to specify number of cpus for parallel processing

77abda3

improve readability

a8b36fd

minor improvements

23f00da

add authors to metadata

a9009b9

alon-albalak added 2 commits April 17, 2024 08:23

add simple run.sh file

de22649

add readme

ca3b6ce

blester125 requested changes Apr 19, 2024

View reviewed changes

pubmedcentral/download_and_convert_to_md.py Outdated Show resolved Hide resolved

pubmedcentral/download_and_convert_to_md.py Outdated Show resolved Hide resolved

store authors as dict of first and last names

48f1468

alon-albalak added 2 commits April 22, 2024 12:21

add 'created' field for each article

f277e28

update readme with created

46df469

blester125 approved these changes Apr 23, 2024

View reviewed changes

alon-albalak merged commit 5d37388 into main Apr 24, 2024
2 checks passed

alon-albalak linked an issue Apr 25, 2024 that may be closed by this pull request

PubMed #8

Closed

alon-albalak deleted the pubmedcentral branch April 25, 2024 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PubMed Central #56

PubMed Central #56

alon-albalak commented Feb 6, 2024 •

edited

Loading

craffel commented Feb 6, 2024

alon-albalak commented Feb 6, 2024

craffel commented Feb 6, 2024

blester125 left a comment

blester125 commented Feb 8, 2024

alon-albalak commented Apr 17, 2024

blester125 left a comment

alon-albalak commented Apr 19, 2024

PubMed Central #56

PubMed Central #56

Conversation

alon-albalak commented Feb 6, 2024 • edited Loading

craffel commented Feb 6, 2024

alon-albalak commented Feb 6, 2024

craffel commented Feb 6, 2024

blester125 left a comment

Choose a reason for hiding this comment

blester125 commented Feb 8, 2024

alon-albalak commented Apr 17, 2024

blester125 left a comment

Choose a reason for hiding this comment

alon-albalak commented Apr 19, 2024

alon-albalak commented Feb 6, 2024 •

edited

Loading