Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running works API data retrieval using cursor inconsistent #356

Open
de-code opened this issue Apr 19, 2018 · 1 comment
Open

Long running works API data retrieval using cursor inconsistent #356

de-code opened this issue Apr 19, 2018 · 1 comment

Comments

@de-code
Copy link

de-code commented Apr 19, 2018

Hi,

I started the last Crossref download via the works API and using a cursor on the 12th April and it finished on the 18th. I didn't get any interruptions this time but I received more pages than I should have - I received 101441 pages, expected around 96102 (total results were 96101447 at the start of the download, now 96251307; with a page size of 1000).

In the result, the same DOI is returned more than once. In some, but not all cases the is-referenced-by-count has changed.

The number of unique DOIs in the whole results is 96220498, something between the total 96101447 at the start and 96251307 by the end.

My questions would be:

  • Is that expected or a bug?
  • Is it possible that it omits works or will it at worst only return them multiple times?

(I have all of the JSON responses saved if that helps)

@weirdf0x
Copy link

I think this is Solr thing, where the result set can change during iteration. Else the information about the set contents would need to be saved somewhere == huge amount of RAM / HDD necessary. I downloaded an initial set with until-index-date filter and use until-index-date in combination with from-index-date to get updates. You will only get the whole set if you are lucky, but at least after few updates most of the data should be downloaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants