-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PubMed Central #56
PubMed Central #56
Conversation
Thanks a lot, @alon-albalak . Can you modify download-convert-to-md.sh to take in a flag that specifies the total docs so that it can be tested without modifying the script? Also, do you mind posting an example .md (ideally of an article that has tables and equations) to inspect? |
Thanks for taking a look @craffel ! I modified the download-convert-to-md.sh script as you requested. For some example .md files, check out As a side note, some files also include supplemental tables as attachments which I didn't integrate into the text because they are pretty infrequent (only 13 out of 10000 files have a supplemental table) and very large when they do exist (on average 1.5M characters). We definitely could include them inline but not sure it would be useful. |
Huh, yeah, those seem pretty reasonable. It's a little weird that it includes image tags - I'm not sure that's something that would be helpful to model, maybe we should try to strip them out? Also, regarding the markdown formatting, I think roughly speaking we should aim to have umarked text. Is there a nice pipeline to go from markdown to unmarked text (with the exception of, perhaps, equations and tables)? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start thanks!
I made some comments that should helpfully help the reliability of code and reduce the number of subprocess calls.
Did you explore using pandoc as part of a dolma parallel processor? IIUC dolma has functionality to automatically parallelize/get the most out of hardware/scale to multiple worker settings. As it is now, if the pandoc part gets to be a bottleneck it seems like it could be hard optimize.
I think things like this would probably be best implemented as a second step in the pipeline, one that uses the dolma parallel processors, wholes whole job would be converting the markdown into an even plainer text format. |
I did not consider this. Do you have an example of how that would look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tiny tweak and then I think it ready to merge! Great Job on this!
Updated the author list as |
This PR contains the initial code for downloading/extracting/converting PubMed Central articles.
It introduces 3 main files:
get-filelist.sh
(downloads the metadata for all pubmed articles and filters out data with non-permissive licenses)download-convert-to-md.sh
(downloads individual pubmed articles as nxml files and converts them to md)to-dolma.py
(formats the articles in the dolma style)To test the code without running over the full dataset, change line 4 in
download-convert-to-md.sh
topython3 download_and_convert_to_md.py --filelist data/permissive_filelist.txt --total_docs 1000