Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed data dumps for works #271

Closed
several27 opened this issue Aug 23, 2017 · 33 comments
Closed

Compressed data dumps for works #271

several27 opened this issue Aug 23, 2017 · 33 comments
Assignees

Comments

@several27
Copy link

Hi, I was trying to get all of the works using the works API endpoint and paging using the cursor. However, I noticed it would take weeks with the current response time for rows parameter set to 1000. Is there a way to speed it up, or maybe do you provide data dumps, for cases when someone needs all of the works metadata?

Thank you :)

@moqri
Copy link

moqri commented Aug 23, 2017

Currently it takes 70 hours to get all the 90 million records with a good internet connection but I totally agree some kind of data dump would help both the users and the server

@weirdf0x
Copy link

I would be interested in a full dump too, even if it would be updated once a month. Once a week would be just perfect, but there's no reason to get greedy :)

You could compress it greatly and / or host somewhere else to reduce the bandwith.

Reasoning for updates: Sometimes you add new fields or fix something in the whole dataset and it needs to be re-indexed, changing the only reliable date field for updates and resulting in a full download.

@several27
Copy link
Author

Oke, so it took me about 90 hours to get all the "works". That's about 170GB of pure JSON data. I used a cheap AWS instance located in the East of USA. If anyone's interested: very short gist to get it efficiently, also can give "requester pays" type access to S3 with the zipped dataset, but can't promise any updates at that moment.

@moqri
Copy link

moqri commented Sep 1, 2017

Please add some wait time between failed attempts to not overload the server:
https://github.com/CrossRef/rest-api-doc#rate-limits

@several27 several27 changed the title Slow response time for deeper cursors for works Compressed data dumps for works Sep 11, 2017
@moqri
Copy link

moqri commented Sep 19, 2017

In regards to the current API Outage/Slowdown, I think providing a dump would help your API traffic too.

@renaud
Copy link

renaud commented Oct 2, 2017

@several27 i'm interested in your dump in S3, could you give me access to it? thanks, Renaud at apache dot org

@kjw
Copy link
Contributor

kjw commented Oct 4, 2017

This is a feature we plan on providing, but I don't want to do "one off" dumps - if we are to do this I want an obvious and workable way for people to keep their copies up-to-date.

@kjw kjw added the enhancement label Oct 4, 2017
@renaud
Copy link

renaud commented Oct 4, 2017

@kjw agreed. One way that e.g. PubMed solves it is by providing daily update files that can be downloaded separately. See this for more info: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt

Another more pragmatic way would be for CrossRef to release dumps for each year, and in the meantime consumers would get fresh data from the API.

@several27
Copy link
Author

several27 commented Oct 4, 2017

@renaud It's a little bit outdated (1st sept), but for everyone interested in full works data dump, here it is: s3://researchably-shared/crossref/2017-09-01.zip. At least until crossref guys do it, feel free to use it. It's only 30GB compressed, and it's a requester-pays type file, so the download cost is paid by the person downloading. (it's the same strategy to what arxiv does - https://arxiv.org/help/bulk_data_s3). As arXiv mentions, the easiest way to download is to use s3cmd, e.g. s3cmd get --requester-pays s3://researchably-shared/crossref/2017-09-01.zip.

Later in the month we'll be probably updating it, if crossref still doesn't have their data dumps, we can share the newer version as well (if they are fine with that) :)

@renaud
Copy link

renaud commented Oct 4, 2017

thanks a lot @several27 !

@moqri
Copy link

moqri commented Oct 7, 2017

Monthly dumps would be great!

@mjmehta15
Copy link

@several27: I downloaded works data dump from s3, which has 68,907,000 records, whereas the crossref dashboard says it has 91,869,582 records. I understand that this zip file on s3, has data only till Sept. 1st but after reducing number for Sep. & Oct. this dump is missing approximately 22,577,780 records. Can you please check and tell how many records you have in your actual 170GB data or is something wrong at my end.

@several27
Copy link
Author

@mjmehta15 Nice catch! I have downloaded this file myself and you're right there's something wrong. Especially that the last line is:

{"status":"error","message-type":"exception","message-version":"1.0.0","message":{"name":"class java.lang.IllegalArgumentException","description":"java.lang.IllegalArgumentException: Vector arg to map conj must be a pair","message":"Vector arg to map conj must be a pair","stack":["clojure.lang.ATransientMap.conj(ATransientMap.java:37)","clojure.lang.ATransientMap.conj(ATransientMap.java:17)","clojure.core$conj_BANG_.invokeStatic(core.clj:3257)","clojure.core$conj_BANG_.invoke(core.clj:3249)","clojure.core.protocols$naive_seq_reduce.invokeStatic(protocols.clj:62)","clojure.core.protocols$interface_or_naive_red....

I doubt it was a compression error, maybe something on crossref side? Anyway, I'm trying to get the rest now, as the last cursor still seems to be working. Unfortunately, the API seems to be very slow now (15s per request) so not sure, how much time will it take.

@renaud
Copy link

renaud commented Oct 9, 2017

Thanks @mjmehta15 and @several27

Regarding slow requests: are you using https AND specifying appropriate contact? (https://github.com/CrossRef/rest-api-doc#good-manners--more-reliable-service) says:

As of September 18th 2017 any API queries that use 
HTTPS and have appropriate contact 
information will be directed to a special pool of API 
machines that are reserved for polite users.

Also: you might want to add a try/catch clause in your main loop...

@mjmehta15
Copy link

Thanks for replying back @several27 and @renaud. Even I am trying get the data using all the good manners mention on the above link shared, yet not able to get complete data. How things at your end @several27 ? are you able to get rest of the data?

@several27
Copy link
Author

@mjmehta15 I should have the rest by the end of today will post info here.
@renaud Thanks! That speed it up 3x :)

@several27
Copy link
Author

several27 commented Oct 12, 2017

@mjmehta15 This dump includes the first + the rest that I fetched from last valid token. s3://researchably-shared/crossref/2017-10-11.zip. I have not looked into the content of it, as I have insufficient time now, but I know there are more than few empty requests, and a total number of items might be more than 100mln. So probably there are some repetitions. Feel free to let me know how it works for you :)

@rnnrght
Copy link

rnnrght commented Dec 14, 2017

Hi, @several27 : Thanks for providing this. My colleague was trying to download the dump but seems to be getting a 400 error. His credentials work for other requester-pays downloads (like arXiv). Are these dumps still available? Is there a problem on your end.

Thanks for looking into this.

@moqri
Copy link

moqri commented Dec 14, 2017

@rnnrght , I just put a copy of the data here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7JIWXI

This data contains the following fields:
doi, year, citation count, issnp, issne, journal, publisher, licence

Please let me know if you need something else and I'll see if I can add.
Code: https://github.com/moqri/Open-Science-Database/blob/master/notebooks/1_get_articles.ipynb

@jenniferlin15 jenniferlin15 self-assigned this Dec 22, 2017
@moqri
Copy link

moqri commented Dec 27, 2017

The data from 1900 to 2010 is now available at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7JIWXI

@de-code
Copy link

de-code commented Feb 5, 2018

If anyone is interested, data downloaded January 2018 now available in Figshare:
https://doi.org/10.6084/m9.figshare.5845554

And just citation links:
https://doi.org/10.6084/m9.figshare.5849916

dhimmel added a commit to greenelab/crossref that referenced this issue Feb 6, 2018
Reference recent Crossref dumps discussed in
#5

Refs CrossRef/rest-api-doc#271
dhimmel added a commit to greenelab/crossref that referenced this issue Feb 6, 2018
Reference recent Crossref dumps discussed in #5.
Closes #5

Refs CrossRef/rest-api-doc#271
dhimmel added a commit to greenelab/crossref that referenced this issue Feb 7, 2018
Reference recent Crossref dumps discussed in #5.
Closes #5

Refs CrossRef/rest-api-doc#271
@rufuspollock
Copy link

I just wondered how come this got closed - the various dumps out there will obviously go out of date. I think what was requested here was a regularly updated authoritative dump from crossref itself (?).

@jenniferlin15
Copy link
Contributor

This conversation went many places. Thanks for following up on this particular thread. We rolled out a "snapshots" feature where full data dumps, updated monthly, are now a part of the Metadata Plus service. For more information please see here: https://www.crossref.org/services/metadata-delivery/plus-service/

@rufuspollock
Copy link

@jenniferlin15 thanks for rapid reply., To clarify for me is the Plus service a pay for service? If so, is there any other source of bulk data from CrossRef that is freely available?

@jenniferlin15
Copy link
Contributor

Yes, Plus is a paid service that builds on top of the public API to help Crossref continue to make the data freely available to the community over the long run.

@rufuspollock
Copy link

@jenniferlin15 thanks and there is no other bulk source of data than the paid service?

@jenniferlin15
Copy link
Contributor

Yes at the moment, correct.

@jinamshah
Copy link

@de-code does your download give different json files for all the pages? and if so, can it be stitched into one?

@de-code
Copy link

de-code commented Jul 23, 2018

@jinamshah yes, a separate json file per page. That was mainly done to avoid any questions where the data came from (as it's the unmodified response from the Crossref API). However, it might be more convenient to create a JSON line files will all of the responses. In my case I created CSV files for further processing instead. (If you do that yourself you would also need to handle duplicates due to #356)

In any case, links to new dumps from April 2018 are available in my crossref-data-notebook.

@jinamshah
Copy link

@de-code how would you create a csv? because the json file is multi-tiered? i.e. items inside a message and a dict of authors inside each item. I'm fairly new at this
thanks in advance

@de-code
Copy link

de-code commented Jul 24, 2018

@jinamshah I think this is getting a bit beyond this ticket. In summary, my CSV wouldn't contain everything but you could comma separate the authors if you wanted to. For more questions regarding our dump, it might be best to raise an issue against our repo.

@bjrne
Copy link

bjrne commented Oct 24, 2020

For anyone coming here from google: @bnewbold took the time to upload newer versions of the dataset on the internet archive pages, as well as create links to the most recent and complete dumps from crossref themselves (2020-04, 112M).

"Official": https://archive.org/details/crossref-doi-metadata-20200408
Self-crawled: https://archive.org/details/crossref_doi_dump_201909

@aplamada
Copy link

aplamada commented Jul 15, 2021

Please see 2021-01-19 - New public data file: 120+ million metadata records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests