Skip to content
This repository has been archived by the owner on Mar 23, 2021. It is now read-only.

More recent dumps? #1

Closed
bnewbold opened this issue Sep 6, 2018 · 5 comments
Closed

More recent dumps? #1

bnewbold opened this issue Sep 6, 2018 · 5 comments

Comments

@bnewbold
Copy link

bnewbold commented Sep 6, 2018

Hi!

I just started a bulk Crossref API dump/scrape using the code at: https://github.com/greenelab/crossref

... but then I noticed you just made a recent commit to this repo, which might be better maintained at this point. My interest is in full API output (not just the derived citation graph), but de-duplication of works would be nice.

Do you have more recent dumps, or would you like to collaborate on generating them? My intent is to upload to archive.org, which can easily accommodate files up to 100GB or so without splitting/sharding.

I would be interested in experimenting with alternative compression algorithms (brotli, xz, zstd) if they would yield faster decompression without too much of a sacrifice in compression ratio... but I could recompress on my own.

@de-code
Copy link
Collaborator

de-code commented Sep 6, 2018

Hi, I did indeed just download Crossref data.
I was going to share it once the post processing is complete as it is also a way of confirming that the data is complete (as it happened before that it stopped early), although it does look complete.

I have the raw responses looks the same as the previous dump: https://doi.org/10.6084/m9.figshare.6170414
(Figshare also allows for bigger files, and they increase your limit for public files if you ask them nicely)

You're probably more interested in the JSON of the works only and not the response body.
It should using LZMA compression which is xz if I am not mistaken.

I am aware the data contains duplicates and I am removing them in the post processed data rather than the raw data. i.e. from the DOI to DOI links as well as "summary" CSV that I create for key information of the individual works (which I then use to calculate further summary CSVs for statistics that are used in the notebook).

It would be interesting to look at a more "efficient" way of storing and sharing the data by also allowing for updates. In particular by partitioning the data, e.g. by published date or prefix (both shouldn't change). But it may not be worth to put too much effort into it as it doesn't seem unlikely that Crossref will eventually provide dumps for free.

I'm interested in your use-case for the data dump?

@bnewbold
Copy link
Author

bnewbold commented Sep 6, 2018

Thanks for these details!

Would it be possible to update the repo with links to dumps on figshare, either in the README or a separate file?

A "baseline" and "deltas" scheme for updates similar to how MEDLINE/PubMed releases metadata would be nice. Or, even better, an append-only-log of changes. I haven't worked out the details, but plan on looking in to using API queries of the style "updated in the past 24 hours" and recording blocks of updates in that fashion at some point.

From my brief reading it looks like Crossref now offers weekly database snapshots as a premium (paid) service.

My use case is building out an archival (file-level) catalog of works, particularly "long tail" open access works without an existing preservation strategy, as part of my work at the Internet Archive. I use metadata from these dumps in a variety of contexts, like matching crawled PDFs to DOI numbers (by title/author fuzzy match), tracking "completeness", etc. I'll be sharing more of this work in the coming months, but some is available now at:

@de-code
Copy link
Collaborator

de-code commented Sep 7, 2018

All of the links are in the notebook, which is currently in a separate branch (probably not the best idea): https://elifesci.org/crossref-data-notebook
(Could add a link to that in the README)

I could also mention that I did experiment with a parallel-download, which downloads data by published date (or date period for much older work). It can be run in Google Dataflow via Apache Beam. But for some reasons it didn't include all of the works. That is either because the Crossref facets aren't complete (the numbers don't add up and I raised it with them) or the way I store the work. That is why it was parked for now. Maybe it's still valuable to look at it for your project.

The problem with a zip containing everything (as it is now), is that extracting from the zip can't be parallelised well (perhaps not the main issue). And if the data is not partitioned it will become slow to merge/sort the data (e.g. find duplicates, replace old with new records).

I considered partitioning by (mainly for files):

  • prefix:
    • Advantages:
      • Definitely not going to change (prefix is part of the DOI)
      • Can easily filter by publisher
    • Disadvantages:
      • There are a lot of prefixes
      • Prefixes will receive updates all the time
      • API did not seem to allow filtering by prefix? (not sure whether that is actually true)
  • published date:
    • Advantages:
      • Old published dates should rarely receive updates
      • There are not that many
      • Can easily just look at the works from last year
    • Disadvantage:
      • Relies on the published date not being updated (Crossref confirmed it shouldn't)

Not sure if any of that is relevant to your use-case if you are putting everything in a database. An append-only log should definitely work. I am assuming you'll then likely want to be able to find the last version of a particular work. It certainly seems better to keep the data up-to-date using the API.

I was talking to someone from Crossref this week and got the impression that the pricing for the snapshots are likely going to get reviewed.

@de-code
Copy link
Collaborator

de-code commented Sep 12, 2018

I've made the dumps and updated stats public now. (All links in the notebook - which is now also linked from the README)

@bnewbold
Copy link
Author

Thanks @de-code!

I ended up wanting to take advantage of de-duplication and retaining the exact same form factor (so my existing scripts would work), so I re-ran the greenelab script and posted to: https://archive.org/details/crossref_doi_dump_201809

@de-code de-code closed this as completed Aug 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants