-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compressed data dumps for works #271
Comments
Currently it takes 70 hours to get all the 90 million records with a good internet connection but I totally agree some kind of data dump would help both the users and the server |
I would be interested in a full dump too, even if it would be updated once a month. Once a week would be just perfect, but there's no reason to get greedy :) You could compress it greatly and / or host somewhere else to reduce the bandwith. Reasoning for updates: Sometimes you add new fields or fix something in the whole dataset and it needs to be re-indexed, changing the only reliable date field for updates and resulting in a full download. |
Oke, so it took me about 90 hours to get all the "works". That's about 170GB of pure JSON data. I used a cheap AWS instance located in the East of USA. If anyone's interested: very short gist to get it efficiently, also can give "requester pays" type access to S3 with the zipped dataset, but can't promise any updates at that moment. |
Please add some wait time between failed attempts to not overload the server: |
In regards to the current API Outage/Slowdown, I think providing a dump would help your API traffic too. |
@several27 i'm interested in your dump in S3, could you give me access to it? thanks, Renaud at apache dot org |
This is a feature we plan on providing, but I don't want to do "one off" dumps - if we are to do this I want an obvious and workable way for people to keep their copies up-to-date. |
@kjw agreed. One way that e.g. PubMed solves it is by providing daily update files that can be downloaded separately. See this for more info: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt Another more pragmatic way would be for CrossRef to release dumps for each year, and in the meantime consumers would get fresh data from the API. |
@renaud It's a little bit outdated (1st sept), but for everyone interested in full Later in the month we'll be probably updating it, if crossref still doesn't have their data dumps, we can share the newer version as well (if they are fine with that) :) |
thanks a lot @several27 ! |
Monthly dumps would be great! |
@several27: I downloaded works data dump from s3, which has 68,907,000 records, whereas the crossref dashboard says it has 91,869,582 records. I understand that this zip file on s3, has data only till Sept. 1st but after reducing number for Sep. & Oct. this dump is missing approximately 22,577,780 records. Can you please check and tell how many records you have in your actual 170GB data or is something wrong at my end. |
@mjmehta15 Nice catch! I have downloaded this file myself and you're right there's something wrong. Especially that the last line is:
I doubt it was a compression error, maybe something on crossref side? Anyway, I'm trying to get the rest now, as the last cursor still seems to be working. Unfortunately, the API seems to be very slow now (15s per request) so not sure, how much time will it take. |
Thanks @mjmehta15 and @several27 Regarding slow requests: are you using https AND specifying appropriate contact? (https://github.com/CrossRef/rest-api-doc#good-manners--more-reliable-service) says:
Also: you might want to add a |
Thanks for replying back @several27 and @renaud. Even I am trying get the data using all the good manners mention on the above link shared, yet not able to get complete data. How things at your end @several27 ? are you able to get rest of the data? |
@mjmehta15 I should have the rest by the end of today will post info here. |
@mjmehta15 This dump includes the first + the rest that I fetched from last valid token. |
Hi, @several27 : Thanks for providing this. My colleague was trying to download the dump but seems to be getting a 400 error. His credentials work for other requester-pays downloads (like arXiv). Are these dumps still available? Is there a problem on your end. Thanks for looking into this. |
@rnnrght , I just put a copy of the data here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7JIWXI This data contains the following fields: Please let me know if you need something else and I'll see if I can add. |
The data from 1900 to 2010 is now available at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7JIWXI |
If anyone is interested, data downloaded January 2018 now available in Figshare: And just citation links: |
Reference recent Crossref dumps discussed in #5 Refs CrossRef/rest-api-doc#271
Reference recent Crossref dumps discussed in #5. Closes #5 Refs CrossRef/rest-api-doc#271
Reference recent Crossref dumps discussed in #5. Closes #5 Refs CrossRef/rest-api-doc#271
I just wondered how come this got closed - the various dumps out there will obviously go out of date. I think what was requested here was a regularly updated authoritative dump from crossref itself (?). |
This conversation went many places. Thanks for following up on this particular thread. We rolled out a "snapshots" feature where full data dumps, updated monthly, are now a part of the Metadata Plus service. For more information please see here: https://www.crossref.org/services/metadata-delivery/plus-service/ |
@jenniferlin15 thanks for rapid reply., To clarify for me is the Plus service a pay for service? If so, is there any other source of bulk data from CrossRef that is freely available? |
Yes, Plus is a paid service that builds on top of the public API to help Crossref continue to make the data freely available to the community over the long run. |
@jenniferlin15 thanks and there is no other bulk source of data than the paid service? |
Yes at the moment, correct. |
@de-code does your download give different json files for all the pages? and if so, can it be stitched into one? |
@jinamshah yes, a separate json file per page. That was mainly done to avoid any questions where the data came from (as it's the unmodified response from the Crossref API). However, it might be more convenient to create a JSON line files will all of the responses. In my case I created CSV files for further processing instead. (If you do that yourself you would also need to handle duplicates due to #356) In any case, links to new dumps from April 2018 are available in my crossref-data-notebook. |
@de-code how would you create a csv? because the json file is multi-tiered? i.e. items inside a message and a dict of authors inside each item. I'm fairly new at this |
@jinamshah I think this is getting a bit beyond this ticket. In summary, my CSV wouldn't contain everything but you could comma separate the authors if you wanted to. For more questions regarding our dump, it might be best to raise an issue against our repo. |
For anyone coming here from google: @bnewbold took the time to upload newer versions of the dataset on the internet archive pages, as well as create links to the most recent and complete dumps from crossref themselves (2020-04, 112M). "Official": https://archive.org/details/crossref-doi-metadata-20200408 |
Hi, I was trying to get all of the works using the works API endpoint and paging using the cursor. However, I noticed it would take weeks with the current response time for
rows
parameter set to 1000. Is there a way to speed it up, or maybe do you provide data dumps, for cases when someone needs all of the works metadata?Thank you :)
The text was updated successfully, but these errors were encountered: