More recent dumps? #1

bnewbold · 2018-09-06T00:55:16Z

Hi!

I just started a bulk Crossref API dump/scrape using the code at: https://github.com/greenelab/crossref

... but then I noticed you just made a recent commit to this repo, which might be better maintained at this point. My interest is in full API output (not just the derived citation graph), but de-duplication of works would be nice.

Do you have more recent dumps, or would you like to collaborate on generating them? My intent is to upload to archive.org, which can easily accommodate files up to 100GB or so without splitting/sharding.

I would be interested in experimenting with alternative compression algorithms (brotli, xz, zstd) if they would yield faster decompression without too much of a sacrifice in compression ratio... but I could recompress on my own.

de-code · 2018-09-06T07:49:09Z

Hi, I did indeed just download Crossref data.
I was going to share it once the post processing is complete as it is also a way of confirming that the data is complete (as it happened before that it stopped early), although it does look complete.

I have the raw responses looks the same as the previous dump: https://doi.org/10.6084/m9.figshare.6170414
(Figshare also allows for bigger files, and they increase your limit for public files if you ask them nicely)

You're probably more interested in the JSON of the works only and not the response body.
It should using LZMA compression which is xz if I am not mistaken.

I am aware the data contains duplicates and I am removing them in the post processed data rather than the raw data. i.e. from the DOI to DOI links as well as "summary" CSV that I create for key information of the individual works (which I then use to calculate further summary CSVs for statistics that are used in the notebook).

It would be interesting to look at a more "efficient" way of storing and sharing the data by also allowing for updates. In particular by partitioning the data, e.g. by published date or prefix (both shouldn't change). But it may not be worth to put too much effort into it as it doesn't seem unlikely that Crossref will eventually provide dumps for free.

I'm interested in your use-case for the data dump?

bnewbold · 2018-09-06T16:43:56Z

Thanks for these details!

Would it be possible to update the repo with links to dumps on figshare, either in the README or a separate file?

A "baseline" and "deltas" scheme for updates similar to how MEDLINE/PubMed releases metadata would be nice. Or, even better, an append-only-log of changes. I haven't worked out the details, but plan on looking in to using API queries of the style "updated in the past 24 hours" and recording blocks of updates in that fashion at some point.

From my brief reading it looks like Crossref now offers weekly database snapshots as a premium (paid) service.

My use case is building out an archival (file-level) catalog of works, particularly "long tail" open access works without an existing preservation strategy, as part of my work at the Internet Archive. I use metadata from these dumps in a variety of contexts, like matching crawled PDFs to DOI numbers (by title/author fuzzy match), tracking "completeness", etc. I'll be sharing more of this work in the coming months, but some is available now at:

de-code · 2018-09-07T10:17:04Z

All of the links are in the notebook, which is currently in a separate branch (probably not the best idea): https://elifesci.org/crossref-data-notebook
(Could add a link to that in the README)

I could also mention that I did experiment with a parallel-download, which downloads data by published date (or date period for much older work). It can be run in Google Dataflow via Apache Beam. But for some reasons it didn't include all of the works. That is either because the Crossref facets aren't complete (the numbers don't add up and I raised it with them) or the way I store the work. That is why it was parked for now. Maybe it's still valuable to look at it for your project.

The problem with a zip containing everything (as it is now), is that extracting from the zip can't be parallelised well (perhaps not the main issue). And if the data is not partitioned it will become slow to merge/sort the data (e.g. find duplicates, replace old with new records).

I considered partitioning by (mainly for files):

prefix:
- Advantages:
  - Definitely not going to change (prefix is part of the DOI)
  - Can easily filter by publisher
- Disadvantages:
  - There are a lot of prefixes
  - Prefixes will receive updates all the time
  - API did not seem to allow filtering by prefix? (not sure whether that is actually true)
published date:
- Advantages:
  - Old published dates should rarely receive updates
  - There are not that many
  - Can easily just look at the works from last year
- Disadvantage:
  - Relies on the published date not being updated (Crossref confirmed it shouldn't)

Not sure if any of that is relevant to your use-case if you are putting everything in a database. An append-only log should definitely work. I am assuming you'll then likely want to be able to find the last version of a particular work. It certainly seems better to keep the data up-to-date using the API.

I was talking to someone from Crossref this week and got the impression that the pricing for the snapshots are likely going to get reviewed.

de-code · 2018-09-12T22:23:18Z

I've made the dumps and updated stats public now. (All links in the notebook - which is now also linked from the README)

bnewbold · 2018-09-24T21:30:07Z

Thanks @de-code!

I ended up wanting to take advantage of de-duplication and retaining the exact same form factor (so my existing scripts would work), so I re-ran the greenelab script and posted to: https://archive.org/details/crossref_doi_dump_201809

bnewbold mentioned this issue Sep 6, 2018

Future updates of bulk Crossref metadata corpus greenelab/crossref#5

Closed

de-code closed this as completed Aug 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More recent dumps? #1

More recent dumps? #1

bnewbold commented Sep 6, 2018

de-code commented Sep 6, 2018

bnewbold commented Sep 6, 2018

de-code commented Sep 7, 2018

de-code commented Sep 12, 2018

bnewbold commented Sep 24, 2018

More recent dumps? #1

More recent dumps? #1

Comments

bnewbold commented Sep 6, 2018

de-code commented Sep 6, 2018

bnewbold commented Sep 6, 2018

de-code commented Sep 7, 2018

de-code commented Sep 12, 2018

bnewbold commented Sep 24, 2018