Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching citations from Crossref using REST API #139

Conversation

Dominic-DallOsto
Copy link
Collaborator

@Dominic-DallOsto Dominic-DallOsto commented Oct 24, 2021

Fixes #43

Uses CrossRef's REST API to get citations for an item. Mostly this works well because the cited items have DOIs which we can resolve. For items without identifiers, I attempt to roughly parse as much information as I can from them. Currently I just discard unstructured citations, but often some data (title, extra authors) is present in here that isn't in the strucutured data, so parsing them would be nice once #113 is resolved. For some items like this the data is CrossRef is still quite messy.

todo:

  • translate strings
  • provide user agent / email for CrossRef API

@Dominic-DallOsto
Copy link
Collaborator Author

Dominic-DallOsto commented Nov 3, 2021

Seems to be working nicely now. A couple of questions though:

  • should we include an email address in the CrossRef API call explained here?
  • should there be more detailed progress message updates in crossref.js (eg. item not found on crossref, item found but no references available)?
  • anything else?

@Dominic-DallOsto Dominic-DallOsto marked this pull request as ready for review November 7, 2021 14:12
@urspx
Copy link

urspx commented Feb 6, 2022

Hi, incredible work on the extension. Thank you so much for it!

I was wondering if this particular feature has seen any progress? It would, in my opinion (or, tbh, for my work), be the most useful addition, since CrossRef is such a big part of the infrastructure and has so much citation data.

Again, thank you, I know I'm just clamoring for features on pro bono work 🙈.

@diegodlh
Copy link
Owner

diegodlh commented Feb 7, 2022

Hi, @urspx! @Dominic-DallOsto has probably done a great work to implement this, but unfortunately I haven't had the time to check it yet. I'm slowly starting to catch up on all his contributions. Hopefully I'll manage to review this one soon. In the meantime, maybe he has a beta XPI for you to give it a try?

@urspx
Copy link

urspx commented Feb 7, 2022

Amazing, thank you! Looking forward :)

I'm happy to test / report bugs where possible to at least contribute something useful (and, once my schedule clears up a bit, perhaps even help with documentation – I know this is off topic, though).

@diegodlh
Copy link
Owner

diegodlh commented Feb 7, 2022

Thank you, @urspx! All bug reports and enhancement requests are welcome. We'll try to address them as fast as we can.

Regarding testing, we have in mind opening a parallel beta channel to help us test our latest changes. We'll post news about this here: #161

And better documentation would definitely be AMAZING. You can contribute whatever (text, screenshots, video) here: https://www.wikidata.org/wiki/Wikidata:Zotero/Cita#Documentation

@dpriskorn
Copy link
Contributor

dpriskorn commented Feb 10, 2022

I just checked out the branch, built an xpi and tested it and got "undefined"
bild
bild
Also I tried following https://www.zotero.org/support/dev/client_coding/plugin_development#setting_up_a_plugin_development_environment but got no error nor a working plugin :/ I'm on windows and it was unclear about how to exactly format the filename and contents.

@dpriskorn
Copy link
Contributor

dpriskorn commented Feb 11, 2022

bild
bild

getFromCrossref: function(menuName) {
        // get items selected
        // filter items with doi
        // generate batch call to crossref
        // only add items not available locally yet
        let items = Zotero.getActiveZoteroPane().getSelectedItems();
        Zotero.debug("getFromCrossref: Got this many items: " + 
        items.length);
        for (let item in items) {
            Zotero.debug("getFromCrossref: started the loop");
            let wrapped_item = SourceItemWrapper(item);
            Services.prompt.alert(
                window,
                "DOI found",
                wrapped_item.doi
            );
            Zotero.debug("getFromCrossref: "+
            "Running item.getFromCrossref now with this.doi: " + 
            wrapped_item.doi);
            item.getFromCrossref(wrapped_item.doi);
        }
    },

I tried improving this pull request with the code above, but for some reason I neither get an error nor an alert with the DOI. 🤷‍♂️
My changes in this branch https://github.com/dpriskorn/zotero-cita/tree/priskorn-crossref

@Dominic-DallOsto
Copy link
Collaborator Author

Thanks, I found the issue there. I'm in the middle of fixing it at the moment.

But it raises the more general question of how to deal with batch actions in the overlay. I guess we should have a different progress message for the batch actions? And would it be best to just call the source item wrapper functions to avoid code duplication? Or to reimplement it for a batch at once?

@dpriskorn
Copy link
Contributor

dpriskorn commented Feb 15, 2022

Thanks, I found the issue there. I'm in the middle of fixing it at the moment.

Nice! I look forward to seeing the solution.

@dpriskorn
Copy link
Contributor

But it raises the more general question of how to deal with batch actions in the overlay. I guess we should have a different progress message for the batch actions? And would it be best to just call the source item wrapper functions to avoid code duplication? Or to reimplement it for a batch at once?

Yes, the other functions seem to take "items" as argument and process them. I think it would be a good idea to be consistent. I don't know how the notification system works currently, but we could limit the functionality for now to only support fetching 1 item from crossref at a time. That would work around the issue, but probably lead to user frustration.

@Futur3r
Copy link
Contributor

Futur3r commented Oct 1, 2022

Hi, why the PR is not merged on master ?
I've tested the code and apart from a pop-up "couldn't get item citations from CrossRef" even though it did get it, everything works fine.

@Futur3r
Copy link
Contributor

Futur3r commented Oct 6, 2022

@Dominic-DallOsto Is there a way to easily debug with an IDE like vscode ?
I would like to fix the problem

@Dominic-DallOsto
Copy link
Collaborator Author

Did you find a way to get the debugging in VSCode to work? I tried once before but didn't manage.

Hi, why the PR is not merged on master ? I've tested the code and apart from a pop-up "couldn't get item citations from CrossRef" even though it did get it, everything works fine.

The only issue is here

await Promise.allSettled(sourceItemsWithDOI.forEach(async (sourceItem) => {
- if an error occurs getting one of the items, all the others are aborted. It's some odd asynchronous Promise and error handling issue I haven't figured out yet.

@Futur3r
Copy link
Contributor

Futur3r commented Oct 13, 2022

No, I didn't find a way.
I can't contribute to cita because I find it too difficult to debug anything.

I debugged this part of your code (manually with the JS console of Zotero) and I found (but I might be wrong) that

https://github.com/Dominic-DallOsto/zotero-cita/blob/128fa832c484c960432bac585f9846e3b7a36952/src/crossref.js#L48

isn't defined anywhere.

@Dominic-DallOsto Dominic-DallOsto force-pushed the Dominic-DallOsto/issue43 branch from 128fa83 to cd7290e Compare October 27, 2022 21:59
@Dominic-DallOsto
Copy link
Collaborator Author

Ok - I finally got things to a more or less working level. The popups aren't yet properly translated, but I hope the functionality is error-free. You can add citations from Crossref by right clicking the item(s), or from the citations menu for an item. The performance isn't amazing for large numbers of citations, because it currently gets each individually by DOI from Crossref.

Any feedback is more than welcome, thanks! @urspx @dpriskorn @Futur3r

Here's a debug build (you'll have to unzip it - Github won't let me upload an xpi here). zotero-cita-v0.6.0-beta-crossref.xpi.zip

@dbaynard
Copy link

Hello, great to see this is working! I've been following this PR (from a distance) for a while, so it's great to see some progress.

I had a question and some comments.

  1. Does the Zotero.HTTP.request function handle rate limits? (I haven't been able to find any API documentation, nor where this function is defined — though I haven't spent that long looking).

    The documentation has some etiquette and rate limits documentation, which aren't implemented in your code but might be implemented upstream (in Zotero.HTTP.request).
    From the other examples (see below) the User Agent has to be passed manually.

zotero-cita/src/wikidata.js

Lines 166 to 180 in 0508d84

const req = await Zotero.HTTP.request(
'POST',
RECONCILE_API.replace(
'$lng',
Services.locale.getRequestedLocale().split('-')[0]
),
{
body: `queries=${encodeURIComponent(JSON.stringify(queries))}`,
headers: {
'User-Agent': `${Wikicite.getUserAgent()} zotero/${Zotero.version}`
},
// Fixme: split large requests instead of disabling timeout #78
timeout: 0
}
);

  1. I noticed some code quality issues, such as formatting issues (a mix of tabs and spaces, for example) which makes it tricky to review on github, with default settings, and inconsistencies (missing syntax).

  2. Some of the code could be cleaner, making it easier to understand what is going on, and review.


For one example of the latter two points, this method,

	static async getCitations(doi) {
        const JSONResponse = await Crossref.getDOI(doi);

        if (JSONResponse){
            const references = JSONResponse.message.reference
            if (references && references.length > 0){
                const parsedReferences = await Promise.all(references.map(async (reference) => {
                    const parsedReference = await Crossref.parseReferenceItem(reference);
                    return parsedReference;
                }));
                return parsedReferences.filter(Boolean);
            }
            else{
                debug("Item found in Crossref but doesn't contain any references");
            }
        }
        return [];
	}

could be something like

    static async getCitations(doi) {
        const JSONResponse = await Crossref.getDOI(doi);
        if (!JSONResponse) return [];

        const references = JSONResponse.message.reference;
        if (!references.length) {
            debug("Item found in Crossref but doesn't contain any references");
            return [];
        }

        const parsedReferences = await Promise.all(
            references.map((reference) => Crossref.parseReferenceItem(reference))
        );
        return parsedReferences.filter(Boolean);
    }

1. Get list of references from Crossref for items - show how many to the user
2. Parse the references source item by source item, indicating progress because it can take a long time
3. Add references for all source items
@Dominic-DallOsto
Copy link
Collaborator Author

Thanks for the detailed feedback!

  1. Does the Zotero.HTTP.request function handle rate limits? (I haven't been able to find any API documentation, nor where this function is defined — though I haven't spent that long looking).

It's here. From what I can tell it handles retrying requests if they fail, but not Crossref "you got rate limited" responses. But getting the reference list from Crossref occurs at most once per item. Most of the calls to Crossref actually happen when parsing the references, which is handled by Zotero's Crossref translator. If we start to experience rate limiting errors, I think the solution would have to be implemented there.

I finally got around to adding the user agent string, thanks!

  1. I noticed some code quality issues, such as formatting issues (a mix of tabs and spaces, for example) which makes it tricky to review on github, with default settings, and inconsistencies (missing syntax).

Sorry about the tabs/space mixture - not sure how that slipped through. The main logic is a bit longer / more verbose now (particularly with all the progress messages), but I hope at least clear enough to follow?

In particular, I added a progress message saying how many items have successfully had their references parsed, because this tends to be quite slow if items have 50+ references.

zotero-cita-v0.6.0-beta-crossref.xpi.zip

Copy link

@dbaynard dbaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! With that, I've taken a look at this, and it's great! I do understand what is happening, though I haven't studied all the underlying zotero APIs. So, for example, I presume the Services.prompt.confirm function is synchronous (this would explain some annoying Zotero behaviour).

I figured this PR has been waiting long enough that a review with some feedback might help get this merged… so feel free to ignore what I have to say, or do whatever you want, but I'm hoping with some improvements we can get this merged and then the feature will be available.

OK from a high level, this seems like a very sensible approach (and it is very annoying that crossref doesn't have a batch API). The division of methods mostly makes sense (though it's clearer if you merge some of them). I didn't spot any minor correctness issues, but I also didn't run the code (need to figure out how to do that as the zotero javascript console did not work as I'd thought).

There is one major issue I spotted, and it's the implicit idea that Promises will always be fulfilled. Given the delays that you have identified, and the potential for transient network failures (made more likely due to crossref rate limiting), it does not seem right to assume all promises will be fulfilled. I've assumed this is not an explicit design decision, as you have a code path that assumes otherwise and thus cannot be reached (I've highlighted this in the review).

The fix needs a design decision: what happens when certain Promises reject?

  1. Fetching references from upstream (you do handle this, but in a more complicated way than necessary)
  2. Parsing references
  3. Adding citations to the database

There is value to failing early; however I think the user experience would be so much better to process citations that are successfully identified, even if there are failures.

Additionally, I think this needs some rate-limiting code, but it should probably be in a subsequent PR (before the next release).

Comment on lines +33 to +46
let sourceItemReferences;
try{
sourceItemReferences = await Promise.all(
sourceItemsWithDOI.map(async (sourceItem) => await Crossref.getReferences(sourceItem.doi))
);
}
catch (error){
progress.updateLine(
'error',
Wikicite.getString('wikicite.crossref.get-citations.error-getting-references')
);
debug(error);
return;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Promise.all the right way to combine these promises?

When any one is rejected, the whole thing will fail, and we'll return.

This means that the confirmation message later will always say references have been found for 100% of items.

wikicite.crossref.get-citations.confirm-message =
  Found references for %s out of %s item(s).\n%s citation(s) will be parsed.

I suspect what you want here is Promise.allSettled; then you'll need to filter the resulting (settled) promises to include only the fulfilled promises, for further processing. I suspect you'll want to print warnings for rejected promises.

However, if there's no rate-limiting in Zotero.HTTP.request (which seems to be the case), Promise.allSettled will fire all the requests at once, which we don't want to do. We should rate limit, appropriately. The simplest way is to use support for limiting concurrent requests from a library, but it isn't that complicated to write rate limit code. I don't think this should block this PR from being merged but it should block a new release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention in this case was that getReferences would handle all errors, so these Promises should always be fulfilled, either with a list of references or []. An error here would be unexpected, and in this case we should stop and return early?

This means that the confirmation message later will always say references have been found for 100% of items.

I don't quite get what you mean here? For the items where we didn't find any references or an error occurred getting the list, the entry in sourceItemReferences will be [], so it will be filtered out of itemsToBeUpdated. Or am I missing something?

Yeah, the rate limiting should be added, but pragmatically it doesn't seem the biggest issue at the moment. We only fire one request per item (which normally would just be 1 or a few). I tested with 130 items at once and didn't get any errors. Like you say, it'd probably make sense to focus on it after everything else is working.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the confirmation message later will always say references have been found for 100% of items.

I don't quite get what you mean here? For the items where we didn't find any references or an error occurred getting the list, the entry in sourceItemReferences will be [], so it will be filtered out of itemsToBeUpdated. Or am I missing something?

No, I made a mistake — I missed that it was a call to sourceItems.length, and you're filtering sourceItems at the top of the function. Here's some of why:

My intention in this case was that getReferences would handle all errors, so these Promises should always be fulfilled, either with a list of references or []. An error here would be unexpected, and in this case we should stop and return early?

That makes sense, though I don't agree with your design choice. Here's why: you're implementing your own error handling logic, rather than using the logic provided by the language, and I don't see how it adds value (whereas I do see how it adds complexity). There is already a way to return early from the method in an (unanticipated) error case — don't catch the error, here. And rather than returning [] for fetch failures, you can Promise.reject.

Here's an alternative: getReferences (now much clearer — thank you) should return a Promise with the (possibly empty) list of references. If the request fails it should Promise.reject. In the highlighted snippet, with Promise.allSettled, only the fulfilled promises are further processed. Any unanticipated error causes the whole method to fail; this needs handling somewhere (but not in the method itself).

As I've said before, my goal here is to make review as easy as possible for the maintainer so your work can be merged. Every added layer of indirection makes the code more complicated to review (and maintain). It's often worth it, and often not. I think adding an implicit error handling layer on top of the one in the language is not worth it, here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I agree with what you're saying. I just don't quite see how it would fit perfectly here, but please suggest something I might be missing.

  • because getReferences can innocently return an empty list if there are no references, it seems a bit weird to reject in this case? Or do you mean only for when an error occurs getting the references? But then I don't see the benefit - if getReferences would reject, we'd do the same as if it returned an empty list anyway?
  • If I understand you correctly, basically you're saying to remove the try/catch here and let an error naturally bubble up and stop execution? While that would make the code much clearer, I think the progress bar error message to a user is somewhat helpful?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • because getReferences can innocently return an empty list if there are no references, it seems a bit weird to reject in this case? Or do you mean only for when an error occurs getting the references?

The latter.

But then I don't see the benefit - if getReferences would reject, we'd do the same as if it returned an empty list anyway?

The benefit is that it is clear what is happening. It is obvious that there are error cases, and that the code is handling them.

  • If I understand you correctly, basically you're saying to remove the try/catch here and let an error naturally bubble up and stop execution?

That is the correct understanding, and it is just one option.

While that would make the code much clearer, I think the progress bar error message to a user is somewhat helpful?

OK, this is good. The code needs to be clear. It also ought to present users with helpful error messages. I think the complication comes because the logic for fetching references is interspersed with the logic for showing the UI.

I'll approach this in a separate message.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that would make the code much clearer, I think the progress bar error message to a user is somewhat helpful?

Incidentally, are you missing a progress.close() before the return?

Also how does this whole plugin work: can I do anything else with zotero while the items are being fetched?

My less structured thoughts are below the break — I'll have to come back to this.


The simple change is to scrap the error message. Failed to get citations for item(s) from Crossref is not that helpful to users. Run the whole method in a try/catch/finally block and then progress.close() once, in the finally.

I've missed something, in my review, that I should have highlighted earlier: handling the data processing and the user interaction together complicates this code.

I fear the right solution might be uniodiomatic for Zotero: to separate the code that deals with the progress from the code doing the fetching. I'd normally solve this in javascript with addEventListener and dispatchEvent. I see examples of the former in this repository, but none of the latter — I presume they work with zotero but I don't know the conventions.

The real challenge I have concerns the need for user interaction. Services.prompt.confirm is called synchronously? So execution of the whole thread is just suspended until it returns? That… that is not how I would expect things to work. Is this the whole program thread? Plugin thread? I… don't like this, at all. Plausible that this dip into zotero development goes no further…

I'm going to have to come back to this, another time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it'll be clearer if you are able to build and test the plugin. But in general yeah, everything is running independently of the main UI. So you can start fetching citations from Crossref then go about doing other stuff in Zotero and the progress box will keep you updated with what's going on. That's why I think it's fairly important - the progress bar is your only way as a user to know your command is still running / what it's doing / when it might finish.

I agree it's a bit ugly having all these progress messages in the code, but that's how the rest of the plugin is structured. Would replacing the progress bar updates with dispatchEvent make a big practical difference?

As you point out - Services.prompt.alert and Services.prompt.confirm are blocking - for when we can't proceeed without user interaction. Here both of these occur more or less towards the start of the method, so it shouldn't be too jarring for the user to need to provide some input given they've just clicked a button to do something (rather than this popping up 2 minutes later).

src/crossref.js Outdated Show resolved Hide resolved
src/crossref.js Outdated Show resolved Hide resolved
src/crossref.js Outdated Show resolved Hide resolved
src/crossref.js Outdated Show resolved Hide resolved
@Dominic-DallOsto
Copy link
Collaborator Author

Thanks again for the detailed feedback! Most of those changes make sense - otherwise I left a few comments.

Like you say, adding references despite an error would be a nicer user experience, particularly if running this on a large number of items.

@dbaynard
Copy link

dbaynard commented Nov 3, 2022

Glad you've found it helpful. Happy to keep up the pace of development on this to get this merged.

I'll figure out how to build the plugin and give it a spin, over the weekend.

(Edit: been delayed on this — I'll give this a go in the week.)

@dbaynard
Copy link

dbaynard commented Nov 4, 2022

Are you following dev:client coding:plugin development [Zotero Documentation] to test the plugin?

@Dominic-DallOsto
Copy link
Collaborator Author

Are you following dev:client coding:plugin development [Zotero Documentation] to test the plugin?

Yeah. The steps here are hopefully easy enough to follow to get it up and running

@Dominic-DallOsto
Copy link
Collaborator Author

I did a quick test using the bottleneck library to rate limit requests with a minimum of 20 ms between each request.

Actually this made the requests both slower and less reliable. I don't know if Zotero.HTTP has some internal rate limiting or how it works, but the requests seem to be sent out over a period of time instead of all at once, even when launching them all at once async.

@dbaynard
Copy link

dbaynard commented Nov 8, 2022

I did a quick test using the bottleneck library to rate limit requests with a minimum of 20 ms between each request.

Nice — good job.

the requests seem to be sent out over a period of time instead of all at once, even when launching them all at once async.

That's what I'd expect: it's how the library is designed, no? The docs for minTime say 'How long to wait after launching a job before launching another one.'

There's no rate limiting in Zotero.HTTP — this matches what you see (i.e. a change in behaviour when you add rate limiting). I also found no 429 error handling, which is how crossref requests that clients pause.


So let's not bother with rate limiting, but we should respond to 429 requests by backing off. Seems reasonable?

@Dominic-DallOsto
Copy link
Collaborator Author

That's what I'd expect: it's how the library is designed, no? The docs for minTime say 'How long to wait after launching a job before launching another one.'

Sorry, that wasn't super clear. With the requests being sent out over a period of time I was referring to even when using Zotero.HTTP, without any rate limiting. I'm not sure if this is some weird quirk with my network setup, but if I try to fetch crossref citations for 1400 items the requests don't all get fired off at once but spaced out over 10-15 seconds.

If I enable the rate limiting, the time for all the requests to be fulfilled increases (fair enough), but a number of requests also fail to get a response.

I guess in the ideal case Zotero would add some 429 handling upstream and we wouldn't have to worry about it? But I'm yet to be able to actually get a 429 response despite my best efforts...

@dbaynard
Copy link

@Dominic-DallOsto Had to step away from this for a bit — if you can add some handling for the 429 case then this will be fine (for now) and better than it was, and you should ping the maintainer to check it over.

We want to be good citizens, and that means handling 429s in every version that is released. Our counterparty might turn on 429 responses in the future. We can't force upgrades.

Thanks for your work on this; I think it will be a really useful feature, when merged.

I'm unlikely to get back to this for at least a month — hopefully this will be merged by then!

@Dominic-DallOsto
Copy link
Collaborator Author

I dug into this a bit more and it's not really trivial to handle 429 responses where they're more likely to occur (when we use Zotero translators to parse the reference lists we get). I'm not sure whether we can even identify this case, or whether it will just appear to us as if the translator failed to find any item.

I added 429 handling for when we directly access the Crossref API, but it just throws and error that is logged and caught.

@Dominic-DallOsto
Copy link
Collaborator Author

Replaced by #300

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for fetching citations from Crossref
6 participants