-
Notifications
You must be signed in to change notification settings - Fork 4
More recent dumps? #1
Comments
Hi, I did indeed just download Crossref data. I have the raw responses looks the same as the previous dump: https://doi.org/10.6084/m9.figshare.6170414 You're probably more interested in the JSON of the works only and not the response body. I am aware the data contains duplicates and I am removing them in the post processed data rather than the raw data. i.e. from the DOI to DOI links as well as "summary" CSV that I create for key information of the individual works (which I then use to calculate further summary CSVs for statistics that are used in the notebook). It would be interesting to look at a more "efficient" way of storing and sharing the data by also allowing for updates. In particular by partitioning the data, e.g. by published date or prefix (both shouldn't change). But it may not be worth to put too much effort into it as it doesn't seem unlikely that Crossref will eventually provide dumps for free. I'm interested in your use-case for the data dump? |
Thanks for these details! Would it be possible to update the repo with links to dumps on figshare, either in the README or a separate file? A "baseline" and "deltas" scheme for updates similar to how MEDLINE/PubMed releases metadata would be nice. Or, even better, an append-only-log of changes. I haven't worked out the details, but plan on looking in to using API queries of the style "updated in the past 24 hours" and recording blocks of updates in that fashion at some point. From my brief reading it looks like Crossref now offers weekly database snapshots as a premium (paid) service. My use case is building out an archival (file-level) catalog of works, particularly "long tail" open access works without an existing preservation strategy, as part of my work at the Internet Archive. I use metadata from these dumps in a variety of contexts, like matching crawled PDFs to DOI numbers (by title/author fuzzy match), tracking "completeness", etc. I'll be sharing more of this work in the coming months, but some is available now at: |
All of the links are in the notebook, which is currently in a separate branch (probably not the best idea): https://elifesci.org/crossref-data-notebook I could also mention that I did experiment with a parallel-download, which downloads data by published date (or date period for much older work). It can be run in Google Dataflow via Apache Beam. But for some reasons it didn't include all of the works. That is either because the Crossref facets aren't complete (the numbers don't add up and I raised it with them) or the way I store the work. That is why it was parked for now. Maybe it's still valuable to look at it for your project. The problem with a zip containing everything (as it is now), is that extracting from the zip can't be parallelised well (perhaps not the main issue). And if the data is not partitioned it will become slow to merge/sort the data (e.g. find duplicates, replace old with new records). I considered partitioning by (mainly for files):
Not sure if any of that is relevant to your use-case if you are putting everything in a database. An append-only log should definitely work. I am assuming you'll then likely want to be able to find the last version of a particular work. It certainly seems better to keep the data up-to-date using the API. I was talking to someone from Crossref this week and got the impression that the pricing for the snapshots are likely going to get reviewed. |
I've made the dumps and updated stats public now. (All links in the notebook - which is now also linked from the README) |
Thanks @de-code! I ended up wanting to take advantage of de-duplication and retaining the exact same form factor (so my existing scripts would work), so I re-ran the greenelab script and posted to: https://archive.org/details/crossref_doi_dump_201809 |
Hi!
I just started a bulk Crossref API dump/scrape using the code at: https://github.com/greenelab/crossref
... but then I noticed you just made a recent commit to this repo, which might be better maintained at this point. My interest is in full API output (not just the derived citation graph), but de-duplication of works would be nice.
Do you have more recent dumps, or would you like to collaborate on generating them? My intent is to upload to archive.org, which can easily accommodate files up to 100GB or so without splitting/sharding.
I would be interested in experimenting with alternative compression algorithms (brotli, xz, zstd) if they would yield faster decompression without too much of a sacrifice in compression ratio... but I could recompress on my own.
The text was updated successfully, but these errors were encountered: