Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PubMed Central #56

Merged
merged 31 commits into from
Apr 24, 2024
Merged

PubMed Central #56

merged 31 commits into from
Apr 24, 2024

Conversation

alon-albalak
Copy link
Collaborator

@alon-albalak alon-albalak commented Feb 6, 2024

This PR contains the initial code for downloading/extracting/converting PubMed Central articles.

It introduces 3 main files:

  • get-filelist.sh (downloads the metadata for all pubmed articles and filters out data with non-permissive licenses)
  • download-convert-to-md.sh (downloads individual pubmed articles as nxml files and converts them to md)
  • to-dolma.py (formats the articles in the dolma style)

To test the code without running over the full dataset, change line 4 in download-convert-to-md.sh to python3 download_and_convert_to_md.py --filelist data/permissive_filelist.txt --total_docs 1000

@craffel
Copy link
Collaborator

craffel commented Feb 6, 2024

Thanks a lot, @alon-albalak . Can you modify download-convert-to-md.sh to take in a flag that specifies the total docs so that it can be tested without modifying the script? Also, do you mind posting an example .md (ideally of an article that has tables and equations) to inspect?

@alon-albalak
Copy link
Collaborator Author

Thanks for taking a look @craffel ! I modified the download-convert-to-md.sh script as you requested.

For some example .md files, check out PMC553995.md and PMC544568.md for tables that were formatted nicely (you'll have to look at the raw files because github hides the table).
See PMC509239.md for a table that is pretty poorly formatted because it is meant to have many figures inside the table. Tables 1 and 2 look pretty rough, but Tables 3 and 4 came out okay.
See PMC1149498.md for a fairly complex equation that couldn't be formatted as native markdown, and was left as LaTeX, and the modelling section of PMC544568.md for some equations formatted in html/markdown.

As a side note, some files also include supplemental tables as attachments which I didn't integrate into the text because they are pretty infrequent (only 13 out of 10000 files have a supplemental table) and very large when they do exist (on average 1.5M characters). We definitely could include them inline but not sure it would be useful.

@craffel
Copy link
Collaborator

craffel commented Feb 6, 2024

Huh, yeah, those seem pretty reasonable. It's a little weird that it includes image tags - I'm not sure that's something that would be helpful to model, maybe we should try to strip them out? Also, regarding the markdown formatting, I think roughly speaking we should aim to have umarked text. Is there a nice pipeline to go from markdown to unmarked text (with the exception of, perhaps, equations and tables)? Thanks.

Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start thanks!

I made some comments that should helpfully help the reliability of code and reduce the number of subprocess calls.

Did you explore using pandoc as part of a dolma parallel processor? IIUC dolma has functionality to automatically parallelize/get the most out of hardware/scale to multiple worker settings. As it is now, if the pandoc part gets to be a bottleneck it seems like it could be hard optimize.

pubmedcentral/PMC1149498.md Outdated Show resolved Hide resolved
pubmedcentral/download-convert-to-md.sh Outdated Show resolved Hide resolved
pubmedcentral/download_and_convert_to_md.py Outdated Show resolved Hide resolved
pubmedcentral/download_and_convert_to_md.py Outdated Show resolved Hide resolved
pubmedcentral/download_and_convert_to_md.py Outdated Show resolved Hide resolved
pubmedcentral/to-dolma.py Outdated Show resolved Hide resolved
pubmedcentral/to-dolma.py Outdated Show resolved Hide resolved
pubmedcentral/to-dolma.py Show resolved Hide resolved
pubmedcentral/to-dolma.py Outdated Show resolved Hide resolved
pubmedcentral/to-dolma.py Outdated Show resolved Hide resolved
@blester125
Copy link
Collaborator

Huh, yeah, those seem pretty reasonable. It's a little weird that it includes image tags - I'm not sure that's something that would be helpful to model, maybe we should try to strip them out? Also, regarding the markdown formatting, I think roughly speaking we should aim to have umarked text. Is there a nice pipeline to go from markdown to unmarked text (with the exception of, perhaps, equations and tables)? Thanks.

I think things like this would probably be best implemented as a second step in the pipeline, one that uses the dolma parallel processors, wholes whole job would be converting the markdown into an even plainer text format.

@alon-albalak
Copy link
Collaborator Author

Did you explore using pandoc as part of a dolma parallel processor? IIUC dolma has functionality to automatically parallelize/get the most out of hardware/scale to multiple worker settings. As it is now, if the pandoc part gets to be a bottleneck it seems like it could be hard optimize.

I did not consider this. Do you have an example of how that would look?

Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny tweak and then I think it ready to merge! Great Job on this!

pubmedcentral/download_and_convert_to_md.py Outdated Show resolved Hide resolved
pubmedcentral/download_and_convert_to_md.py Outdated Show resolved Hide resolved
@alon-albalak
Copy link
Collaborator Author

Updated the author list as List[{"first": str, "last": str}]

@alon-albalak alon-albalak merged commit 5d37388 into main Apr 24, 2024
2 checks passed
@alon-albalak alon-albalak linked an issue Apr 25, 2024 that may be closed by this pull request
@alon-albalak alon-albalak deleted the pubmedcentral branch April 25, 2024 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PubMed
4 participants