-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future updates of bulk Crossref metadata corpus #5
Comments
@bnewbold thanks for your interest!
Indeed! I'm hoping Crossref starts releasing their database dumps, so we don't have to keep going through the laborious steps of recreating them by millions of API calls 😸 . However, until then, it'd be nice to update this repo and data release. Shortly after we did our extraction in April 2017, Crossref's API began returning citation information. This is really useful but makes the API responses much larger. It also opens up the possibility that we could produce to DOI-to-DOI citation table, which I'm sure would appeal to many users. Anyways, I wasn't planning on updating these records until I needed newer data for my research (which could be never). Rerunning the API queries will probably take several weeks. You may run into issues. You'll need an internet connection with good uptime! @bnewbold if this is something you're interested in, we'd love if you could open a PR with the updates. We'd also love to add your revised DB dump to the figshare. You'd become an author on the figshare dataset and potentially other work we do in the future that makes use of this data. What do you think? I'd also likely be interested in extracted the citation graph from the enlarged dump. |
I have |
Nice! IIRC correctly the order of the works is somewhat chronological. The newer works tend to have more metadata, so things slow down and become more error prone closer to the end. Happy to help if anything breaks.
Nice! I believe these will be the I4OC citations (more accurately "references"), which should be much more prevalent than the OpenCitations corpus we processed in |
The script is about 80% complete. It halted last week when the local hard disk ran out of space (from a bug in an unrelated script), but has been restarted with the most recent cursor position just now. |
I ran into two problems: the other script misbehaved again (disk filled up), and it seemed like the dump hadn't continued where it had left off when I restarted back on Jan 5th (even though I specified a cursor). I'm not sure if the cursor is local (mongodb) or remote (crossref API), but the mongo container had been restarted and it had been more than a few days between failure and restart, either of which could have caused problems. |
@bnewbold thanks for the update. My understanding of the cursor is that it's remote. I'm not sure how long cursors are retained... the cursor could have been retired after some days of inactivity. Ideally, specifying an invalid cursor would trigger an error and not proceed silently.
Nice! nearing 100 million.
It'll be nice to have metadata through all of 2017. There may be a few articles published in 2017 that still have been deposited in Crossref, but I hope not too many |
The script completed successfully yesterday (2018-01-21) after about 11 days: 94035712/94035712 [270:33:11<00:00, 32.70it/s]Finished queries with 93,585,242 works in db I'm not sure what the discrepancy is between 93,585,242 and 94,035,712; some works get skipped intentionally? I'm dumping to |
What immediately comes to mind is if Crossref had multiple records for the same DOI. That could make the query number larger than the MongoDB number: Line 37 in 768a49b
Do you still have the log? I wonder if we should preserve this as well? It could probably help us diagnose the discrepancy. |
I do have the complete log (including my first failed attempt). Skimming through it, doesn't look like it will answer this question, but i'll include it when I upload. mongo-export is still running, about 3/4 complete now 22 hours in. The new corpus is significantly larger, presumably because of citation and maybe other new metadata being included. I estimate it will be 250 GB of uncompressed JSON, or about 25 GB compressed (xz, default settings). |
Uploaded here: https://archive.org/download/crossref_doi_dump_201801/crossref-works.2018-01-21.json.xz File is 30980612708 bytes (~29 GB), sha256 is Logs are uploaded to the same item, but might take a few minutes to appear (while main file is still being hashed and replicated). Running the
My interest was in getting the |
I'll hold off on tearing down the mongo database for a few days in case it ends up being useful. Two other infrastructure notes I had from setting up this run:
|
I don't think
Do you think you could open a PR with at least the update to:
I'd like for you to be in the commit history. If you get the notebooks to run, great. Otherwise I can try to do it by importing the dump. |
Here's what I get trying to use the above jupyter line:
|
Ah I've hit that annoying bug as well in anaconda/nb_conda_kernels#34 (comment). If you add Also I just came across https://github.com/elifesciences/datacapsule-crossref by @de-code which seems to have also downloaded the works data from Crossref. |
Yes, I've just updated the download recently. I will try to share the dump. But the whole works dumps is about 32 GB. Looking for an easy way to get that into Figshare (from a headless server). (I also have just citation links which is a more manageable <3 GB) |
Data downloaded January 2018 now available in Figshare: And just citation links: |
There is also an open issue / request for Crossref to provide something similar: CrossRef/rest-api-doc#271 |
This file, along with logs from it's creation, is available at https://archive.org/download/crossref_doi_dump_201801 Refs #5
Reference recent Crossref dumps discussed in #5 Refs CrossRef/rest-api-doc#271
Reference recent Crossref dumps discussed in #5. Closes #5 Refs CrossRef/rest-api-doc#271
Reference recent Crossref dumps discussed in #5. Closes #5 Refs CrossRef/rest-api-doc#271
In case anybody is interested, i've started another dump using the exact same code path today. Not sure if i'll continue updating dumps in the future, but I wanted a fresher one and might as well share. I cross-posted at elifesciences/datacapsule-crossref#1 as well. Notes since last time:
|
Nice!
Hmm. I thought this repository should be replacing duplicate DOI entries in the Mongo database. In other words, the Mongo DB should only contain the most recently added metadata for a DOI: Lines 33 to 37 in 1dc4171
That is unfortunate that the iterative queries return duplicate DOIs (I thought the cursor was to prevent / I hope other DOIs aren't missing). However, |
@dhimmel I hadn't noticed this behavior (DOI upsert) of these scripts (which I haven't read, just run blindly). My most recent dump completed and is available here: https://archive.org/download/crossref_doi_dump_201809 SHA256 available here (feel free to PR/merge): bnewbold@9f99032 |
Merges #11 Created by Bryan Newbold (@bnewbold). Queries initiated on 2018-09-05. Refs #5 (comment) Full dataset and logs online at https://archive.org/download/crossref_doi_dump_201809
Repo updated with Newbold's September 2018 dump
Awesome, added checksum in #11 / 48a8589, updated README in cc79bd0, and tweeted:
Looks like the queries took 16 days (2018-09-05 to 2018-09-20). File size is 33.2 GB, up from 28.9 GB for the January 2019 release. @bnewbold I have a slight preference that if you make another update in the future, for you to open a new issue. |
Thanks @bnewbold for the effort to upload newer versions of the dataset to the internet archive pages "Official" crossref dump 2020-04: https://archive.org/details/crossref-doi-metadata-20200408 |
possible to get single file? |
In April 2017 @dhimmel uploaded a bulk snapshot of Crossref metadata to figshare (where it was assigned DOI
10.6084/m9.figshare.4816720.v1
).While this metadata can be scraped from the Crossref API by anybody (eg, using the scripts in this repository), I found it really helpful to grab in bulk form.
I'm curious whether this dump could be updated on an annual or quarterly basis. I don't have a particular need for the the data to be versioned (eg, assigned sequential
.v2
,.v3
DOIs at figshare), but that would probably help with discovery for other folks and generally be a best practice.If nobody has time to do such an update I will probably run the scripts from this repository and push to archive.org at: https://archive.org/details/ia_biblio_metadata.
The text was updated successfully, but these errors were encountered: