Compressed data dumps for works #271

several27 · 2017-08-23T14:55:00Z

Hi, I was trying to get all of the works using the works API endpoint and paging using the cursor. However, I noticed it would take weeks with the current response time for rows parameter set to 1000. Is there a way to speed it up, or maybe do you provide data dumps, for cases when someone needs all of the works metadata?

Thank you :)

The text was updated successfully, but these errors were encountered:

moqri · 2017-08-23T17:40:46Z

Currently it takes 70 hours to get all the 90 million records with a good internet connection but I totally agree some kind of data dump would help both the users and the server

weirdf0x · 2017-08-23T20:06:08Z

I would be interested in a full dump too, even if it would be updated once a month. Once a week would be just perfect, but there's no reason to get greedy :)

You could compress it greatly and / or host somewhere else to reduce the bandwith.

Reasoning for updates: Sometimes you add new fields or fix something in the whole dataset and it needs to be re-indexed, changing the only reliable date field for updates and resulting in a full download.

several27 · 2017-09-01T10:31:03Z

Oke, so it took me about 90 hours to get all the "works". That's about 170GB of pure JSON data. I used a cheap AWS instance located in the East of USA. If anyone's interested: very short gist to get it efficiently, also can give "requester pays" type access to S3 with the zipped dataset, but can't promise any updates at that moment.

moqri · 2017-09-01T13:50:11Z

Please add some wait time between failed attempts to not overload the server:
https://github.com/CrossRef/rest-api-doc#rate-limits

moqri · 2017-09-19T12:47:07Z

In regards to the current API Outage/Slowdown, I think providing a dump would help your API traffic too.

renaud · 2017-10-02T14:21:48Z

@several27 i'm interested in your dump in S3, could you give me access to it? thanks, Renaud at apache dot org

kjw · 2017-10-04T10:02:31Z

This is a feature we plan on providing, but I don't want to do "one off" dumps - if we are to do this I want an obvious and workable way for people to keep their copies up-to-date.

renaud · 2017-10-04T10:57:54Z

@kjw agreed. One way that e.g. PubMed solves it is by providing daily update files that can be downloaded separately. See this for more info: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt

Another more pragmatic way would be for CrossRef to release dumps for each year, and in the meantime consumers would get fresh data from the API.

several27 · 2017-10-04T11:59:33Z

@renaud It's a little bit outdated (1st sept), but for everyone interested in full works data dump, here it is: s3://researchably-shared/crossref/2017-09-01.zip. At least until crossref guys do it, feel free to use it. It's only 30GB compressed, and it's a requester-pays type file, so the download cost is paid by the person downloading. (it's the same strategy to what arxiv does - https://arxiv.org/help/bulk_data_s3). As arXiv mentions, the easiest way to download is to use s3cmd, e.g. s3cmd get --requester-pays s3://researchably-shared/crossref/2017-09-01.zip.

Later in the month we'll be probably updating it, if crossref still doesn't have their data dumps, we can share the newer version as well (if they are fine with that) :)

renaud · 2017-10-04T13:13:23Z

thanks a lot @several27 !

moqri · 2017-10-07T13:54:43Z

Monthly dumps would be great!

mjmehta15 · 2017-10-09T05:53:39Z

@several27: I downloaded works data dump from s3, which has 68,907,000 records, whereas the crossref dashboard says it has 91,869,582 records. I understand that this zip file on s3, has data only till Sept. 1st but after reducing number for Sep. & Oct. this dump is missing approximately 22,577,780 records. Can you please check and tell how many records you have in your actual 170GB data or is something wrong at my end.

several27 · 2017-10-09T10:36:54Z

@mjmehta15 Nice catch! I have downloaded this file myself and you're right there's something wrong. Especially that the last line is:

{"status":"error","message-type":"exception","message-version":"1.0.0","message":{"name":"class java.lang.IllegalArgumentException","description":"java.lang.IllegalArgumentException: Vector arg to map conj must be a pair","message":"Vector arg to map conj must be a pair","stack":["clojure.lang.ATransientMap.conj(ATransientMap.java:37)","clojure.lang.ATransientMap.conj(ATransientMap.java:17)","clojure.core$conj_BANG_.invokeStatic(core.clj:3257)","clojure.core$conj_BANG_.invoke(core.clj:3249)","clojure.core.protocols$naive_seq_reduce.invokeStatic(protocols.clj:62)","clojure.core.protocols$interface_or_naive_red....

I doubt it was a compression error, maybe something on crossref side? Anyway, I'm trying to get the rest now, as the last cursor still seems to be working. Unfortunately, the API seems to be very slow now (15s per request) so not sure, how much time will it take.

renaud · 2017-10-09T12:12:01Z

Thanks @mjmehta15 and @several27

Regarding slow requests: are you using https AND specifying appropriate contact? (https://github.com/CrossRef/rest-api-doc#good-manners--more-reliable-service) says:

As of September 18th 2017 any API queries that use 
HTTPS and have appropriate contact 
information will be directed to a special pool of API 
machines that are reserved for polite users.

Also: you might want to add a try/catch clause in your main loop...

mjmehta15 · 2017-10-11T05:41:24Z

Thanks for replying back @several27 and @renaud. Even I am trying get the data using all the good manners mention on the above link shared, yet not able to get complete data. How things at your end @several27 ? are you able to get rest of the data?

several27 · 2017-10-11T09:03:13Z

@mjmehta15 I should have the rest by the end of today will post info here.
@renaud Thanks! That speed it up 3x :)

several27 · 2017-10-12T06:35:14Z

@mjmehta15 This dump includes the first + the rest that I fetched from last valid token. s3://researchably-shared/crossref/2017-10-11.zip. I have not looked into the content of it, as I have insufficient time now, but I know there are more than few empty requests, and a total number of items might be more than 100mln. So probably there are some repetitions. Feel free to let me know how it works for you :)

rnnrght · 2017-12-14T19:52:14Z

Hi, @several27 : Thanks for providing this. My colleague was trying to download the dump but seems to be getting a 400 error. His credentials work for other requester-pays downloads (like arXiv). Are these dumps still available? Is there a problem on your end.

Thanks for looking into this.

moqri · 2017-12-14T23:19:20Z

@rnnrght , I just put a copy of the data here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7JIWXI

This data contains the following fields:
doi, year, citation count, issnp, issne, journal, publisher, licence

Please let me know if you need something else and I'll see if I can add.
Code: https://github.com/moqri/Open-Science-Database/blob/master/notebooks/1_get_articles.ipynb

moqri · 2017-12-27T07:30:55Z

The data from 1900 to 2010 is now available at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7JIWXI

de-code · 2018-02-05T17:45:41Z

If anyone is interested, data downloaded January 2018 now available in Figshare:
https://doi.org/10.6084/m9.figshare.5845554

And just citation links:
https://doi.org/10.6084/m9.figshare.5849916

Reference recent Crossref dumps discussed in #5 Refs CrossRef/rest-api-doc#271

Reference recent Crossref dumps discussed in #5. Closes #5 Refs CrossRef/rest-api-doc#271

rufuspollock · 2018-07-08T15:44:40Z

I just wondered how come this got closed - the various dumps out there will obviously go out of date. I think what was requested here was a regularly updated authoritative dump from crossref itself (?).

jenniferlin15 · 2018-07-08T15:48:40Z

This conversation went many places. Thanks for following up on this particular thread. We rolled out a "snapshots" feature where full data dumps, updated monthly, are now a part of the Metadata Plus service. For more information please see here: https://www.crossref.org/services/metadata-delivery/plus-service/

rufuspollock · 2018-07-08T16:03:45Z

@jenniferlin15 thanks for rapid reply., To clarify for me is the Plus service a pay for service? If so, is there any other source of bulk data from CrossRef that is freely available?

jenniferlin15 · 2018-07-08T22:31:49Z

Yes, Plus is a paid service that builds on top of the public API to help Crossref continue to make the data freely available to the community over the long run.

rufuspollock · 2018-07-09T08:48:03Z

@jenniferlin15 thanks and there is no other bulk source of data than the paid service?

jenniferlin15 · 2018-07-10T15:48:18Z

Yes at the moment, correct.

jinamshah · 2018-07-23T10:28:03Z

@de-code does your download give different json files for all the pages? and if so, can it be stitched into one?

de-code · 2018-07-23T12:25:14Z

@jinamshah yes, a separate json file per page. That was mainly done to avoid any questions where the data came from (as it's the unmodified response from the Crossref API). However, it might be more convenient to create a JSON line files will all of the responses. In my case I created CSV files for further processing instead. (If you do that yourself you would also need to handle duplicates due to #356)

In any case, links to new dumps from April 2018 are available in my crossref-data-notebook.

jinamshah · 2018-07-24T07:15:57Z

@de-code how would you create a csv? because the json file is multi-tiered? i.e. items inside a message and a dict of authors inside each item. I'm fairly new at this
thanks in advance

de-code · 2018-07-24T09:00:59Z

@jinamshah I think this is getting a bit beyond this ticket. In summary, my CSV wouldn't contain everything but you could comma separate the authors if you wanted to. For more questions regarding our dump, it might be best to raise an issue against our repo.

bjrne · 2020-10-24T21:20:29Z

For anyone coming here from google: @bnewbold took the time to upload newer versions of the dataset on the internet archive pages, as well as create links to the most recent and complete dumps from crossref themselves (2020-04, 112M).

"Official": https://archive.org/details/crossref-doi-metadata-20200408
Self-crawled: https://archive.org/details/crossref_doi_dump_201909

aplamada · 2021-07-15T07:42:52Z

Please see 2021-01-19 - New public data file: 120+ million metadata records.

several27 changed the title ~~Slow response time for deeper cursors for works~~ Compressed data dumps for works Sep 11, 2017

kjw added the enhancement label Oct 4, 2017

jenniferlin15 self-assigned this Dec 22, 2017

de-code mentioned this issue Feb 5, 2018

Future updates of bulk Crossref metadata corpus greenelab/crossref#5

Closed

dhimmel added a commit to greenelab/crossref that referenced this issue Feb 6, 2018

README: other resources section

c9fd6fe

Reference recent Crossref dumps discussed in #5 Refs CrossRef/rest-api-doc#271

dhimmel added a commit to greenelab/crossref that referenced this issue Feb 6, 2018

README: other resources section

9551d99

Reference recent Crossref dumps discussed in #5. Closes #5 Refs CrossRef/rest-api-doc#271

dhimmel mentioned this issue Feb 6, 2018

README: other resources section greenelab/crossref#7

Merged

dhimmel added a commit to greenelab/crossref that referenced this issue Feb 7, 2018

README: other resources section (#7)

1dc4171

Reference recent Crossref dumps discussed in #5. Closes #5 Refs CrossRef/rest-api-doc#271

jenniferlin15 closed this as completed Apr 27, 2018

rufuspollock mentioned this issue Nov 28, 2019

Bibliographic data datasets/awesome-data#284

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compressed data dumps for works #271

Compressed data dumps for works #271

several27 commented Aug 23, 2017

moqri commented Aug 23, 2017 •

edited

Loading

weirdf0x commented Aug 23, 2017

several27 commented Sep 1, 2017

moqri commented Sep 1, 2017

moqri commented Sep 19, 2017

renaud commented Oct 2, 2017

kjw commented Oct 4, 2017

renaud commented Oct 4, 2017 •

edited

Loading

several27 commented Oct 4, 2017 •

edited

Loading

renaud commented Oct 4, 2017

moqri commented Oct 7, 2017

mjmehta15 commented Oct 9, 2017

several27 commented Oct 9, 2017

renaud commented Oct 9, 2017 •

edited

Loading

mjmehta15 commented Oct 11, 2017

several27 commented Oct 11, 2017

several27 commented Oct 12, 2017 •

edited

Loading

rnnrght commented Dec 14, 2017

moqri commented Dec 14, 2017 •

edited

Loading

moqri commented Dec 27, 2017

de-code commented Feb 5, 2018

rufuspollock commented Jul 8, 2018

jenniferlin15 commented Jul 8, 2018

rufuspollock commented Jul 8, 2018

jenniferlin15 commented Jul 8, 2018

rufuspollock commented Jul 9, 2018

jenniferlin15 commented Jul 10, 2018

jinamshah commented Jul 23, 2018

de-code commented Jul 23, 2018

jinamshah commented Jul 24, 2018

de-code commented Jul 24, 2018

bjrne commented Oct 24, 2020

aplamada commented Jul 15, 2021 •

edited

Loading

Compressed data dumps for works #271

Compressed data dumps for works #271

Comments

several27 commented Aug 23, 2017

moqri commented Aug 23, 2017 • edited Loading

weirdf0x commented Aug 23, 2017

several27 commented Sep 1, 2017

moqri commented Sep 1, 2017

moqri commented Sep 19, 2017

renaud commented Oct 2, 2017

kjw commented Oct 4, 2017

renaud commented Oct 4, 2017 • edited Loading

several27 commented Oct 4, 2017 • edited Loading

renaud commented Oct 4, 2017

moqri commented Oct 7, 2017

mjmehta15 commented Oct 9, 2017

several27 commented Oct 9, 2017

renaud commented Oct 9, 2017 • edited Loading

mjmehta15 commented Oct 11, 2017

several27 commented Oct 11, 2017

several27 commented Oct 12, 2017 • edited Loading

rnnrght commented Dec 14, 2017

moqri commented Dec 14, 2017 • edited Loading

moqri commented Dec 27, 2017

de-code commented Feb 5, 2018

rufuspollock commented Jul 8, 2018

jenniferlin15 commented Jul 8, 2018

rufuspollock commented Jul 8, 2018

jenniferlin15 commented Jul 8, 2018

rufuspollock commented Jul 9, 2018

jenniferlin15 commented Jul 10, 2018

jinamshah commented Jul 23, 2018

de-code commented Jul 23, 2018

jinamshah commented Jul 24, 2018

de-code commented Jul 24, 2018

bjrne commented Oct 24, 2020

aplamada commented Jul 15, 2021 • edited Loading

moqri commented Aug 23, 2017 •

edited

Loading

renaud commented Oct 4, 2017 •

edited

Loading

several27 commented Oct 4, 2017 •

edited

Loading

renaud commented Oct 9, 2017 •

edited

Loading

several27 commented Oct 12, 2017 •

edited

Loading

moqri commented Dec 14, 2017 •

edited

Loading

aplamada commented Jul 15, 2021 •

edited

Loading