Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Extend ArXiv fetcher results by using data from related DOIs (#9170)
* Added new Fetcher that mimics ArXiv fetcher, but also gets new fields from ArXiv-issued DOI The new Fetcher, 'ArXivWithDoi', implements the same interfaces as the 'ArXiv' fetcher and, for the most part, rely on calling the latter for complying with these interfaces. The actual innovation is that all BibEntries returned from 'ArXiv' are replaced by a merged version between it and a BibEntry from the DoiFetcher (with ArXiv's fields getting the priority) Additionally, two other changes were made: - Method 'merge()' to BibEntry, so that two BibEntries can be merged (with priority given to the current 'BibEntry' object) - FOR NOW, all references to the use of the 'ArXiv' class in other parts of the database (```WebFetchers```, ```ImportHandler``` and ```CompositeIdFetcher```) were replaced with the new 'ArXivWithDoi' class, so that all UI path lead to the use of this new version. Future commits may make its use optional with feature flags A small number of manual tests were made. More to come. Next commits should also integrate automated tests. * Added feature flag between new and old behavior in preferences menu (Import and Export) One might what to toggle between a more simple response from ArXiv and a more complete one (with the additional information from the auto-assigned DOI). Maybe for a more concise Bibtex entry, maybe for lower number of request to web APIs. In either way, a boolean feature flag has been added for toggling between the new and old behavior (the **latter** is selected by default). It can be found in "Options > Preferences > Import and Export > General > Use ArXiv-issued DOI for complementing ArXiv entry" For this commit, the branching between new and old behavior was decided to be made inside the new 'ArXivWithDoi' fetcher (which effectively envelops the 'ArXiv' fetcher), relying on information being provided by an external 'Import preferences'. This decisions avoids repetitions in multiple places that directly use the fetcher. * Merged both new and old ArXiv fecthers into one Having two files marked as the 'ArXiv fetcher' is kind of bad, and if we adhere to getting the maximum information possible, it would be bad if any other file could directly instantiate the fetcher with less information. At the same time, a separation between the class that does the heavy lifting (ArXiv.java) and the one who just does post-processing (ArXivWithDoi.java) (specially because it directly uses another fetcher) is required as good coding practice. This way, with this commit, only one class is kept: ArXivFetcher, a copy of ArXivWithDoi with the previous ArXiv class as an internal, private class. * Removed feature flag / button, implemented prioritized fields from DOI and contained an "External program change" bug Because the last commit had been generating some unpredictable errors (some of which I assume was related to the may changes made for the feature flag implementation), I "reseted" all changes (i.e. "git checkout main .") and just left the new unified ArXiv fetcher and changes to BibEntry, theoretically leaving this commit in a state similar to the previous one, but without the feature falg button. I also implemented a way to be selective of what fields from the DOI entry should overwrite the ones from ArXiv (for now, 'KEYWORDS' and 'AUTHOR') and changed the 'mergeWith()' function to apply changes to the current object, not returning a copy as before. Now, the thing that mostly took my time since the last commit was this weird bug: when saving a database with imported entries from the new ArXiv fetcher, a prompt "The library has been modified by another program." would always apper, prompting to accept some changes, which always included a modification to the newly added entry. This made no sense, as there was neither an involvment from an external program, nor a modification since manually saving the database. I seem to have found a possible very weird cause: this would always happen when setting the 'KEYWORD' field of the resulting BibEntry to the raw string from the DOI BibTex (as discussed before, it contains more detailed information, so it was included on the "prioritized fields" from DOI). The thing is, this string contained a duplicated "keyword", the FOS (that I suppose stands for "Field Of Subject" or similar) of the entry. You can see this behavior by making a GET request to https://doi.org/[ArXiv-assigned DOI] with header "Accept=application/x-bibtex". When removing this duplication, this bug suddenly disappeared (it showed once, but not since). Maybe future commits will include a more resolute fix for this bug, but the current fix cannot really affect the end result (as unique keywrods is what one would expect), so I leave at that for now. * Get even more info from user-assigned DOIs Beyond only new information coming from the always-consistent ArXiv-issued DOI entries, sometimes the ArXiv fetcher returns a DOI manually assigned by the publishing user, usually leading to other repositories (like ScienceDirect) and containing even more fields. As such, this commit tweaks th ArXiv fetcher to also include these new fields on the final result. As the actual structure can be, at first glance, unpredictable (since it's coming from diferent services), the only thing that actually overwrites the overall Bibtex entry (after merging with ArXiv-assigned DOI entry) is the DOI, as it represents a link to another repository. Future updates could revise this decision, including setting up a way for cataloging these different Bibtex structures and choosing which provider has the "best formatted field" (from ArXiv, ArXiv's DOI or external DOI) * Made modifications to ArXivFetcher and its testing, passing all of them This commit was set to fix the ArXivFetcherTest so that tests could run and pass. This envolved several tweaks to the test file, including adding and updating manual BibEntries with the new fields, mocking more behavior for DoiFetcher, among other things. Now, all ENABLED tests from this file should pass. This process lead to some modifications being need on the ArXivFetcher, like: - Ignoring ArXiv-issued DOIs info when querying ArXiv's API for entries, as it doesn't return any entry (only user-issued ones are able to, for what I could piece together) - Replaced "JOURNALTITLE" field to simply "JOURNAL", as it was more standard across tests - Added "PUBLISHER" field as a priority for user-issued DOIs Futhermore, I realized a certain problem with using the "KEYWORDS" field from ArXiv-issued DOIs: currently, two of the category taxonomies (https://arxiv.org/category_taxonomy) that are included in their expanded (full) form have commas as part of their names: "cs.CE (Computational Engineering, Finance, and Science)" and " cs.DC (Distributed, Parallel, and Cluster Computing)". As we use commas as standard separators of keywords, this might be a problem... A solution could be envolving it in curly brackets, but this should be discussed beforehand. * Undone several String changes from "ArXivFetcher" back to "ArXiv" When refactoring the name of the ArXiv fetcher, comments that included "ArXiv" were turned into "ArXivFetcher", even though some tests relied on the "ArXiv" string as the name of the fetcher (and still is, by the "getName()" method). This commit reverts this change. * Some fixes to tests broken by changes from the new ArXiv fetcher * Parallelize process of field infusion for search queries Instead of sequentially calling the DOIFetcher (possibly twice) for every entry returned after querying the ArXiv API, make them happen in parallel, as they are completely unrelated to one another. This seem to significantly reduce processing time for larger paged results. * Parallelized all extended ArXiv fetching process to reduce on processing time In the previous commit, I only parallelized the batch processing of entries that are part of a search query. Now, individual searchs for specific IDs are also sped up by parallel requests to the other sources that serve as complement to the original ArXiv entry (from ArXiv and user-issued DOIs). In other for that to happen, most of the previous code had to be adapted for parallel processing, specialy with the use of CompletableFutures. If these changes stay, more automated testing is expected, as well as an API retry system in case of API throttling. * Added API Rate Limiting for async calls to DOI API (DOI Content Negociation) As a previous worry about using multithreading to make the extended ArXiv fetching process, API throlling SHOULD be mostly avoided, with the integration of a limiter for agencies "DataCite" (used by ArXiv-issued DOIs) and "Crossref". These two agencies define the maximum rate in which applications can perform API requests via DOI Content Negotiation (mEDRA does not seem to explicitly, so it's not being considered for now). For that to be possible, other changes had to be made in URLDownload for reduing connection overhead. * Replaced commas from ArXiv keywords (Category Taxonomy) by slashes As mentioned in the PR, 2 resulting keywords would contain commas, which is the default keyword separator on JabRef. So, this commit fixes that by adjusting these two cases with foward slashes * Small changes to logging messages, exception handling, comments, etc. * Added some tests, comments and docuemntation * QUICKFIX: avoinding potential NullPointerException on response header * QUICKFIX: change in selected fields afected CompositeIdFetcher tests * QUICKFIX: forgot to include entry on CHANGELOG * Modified keyword duplication removal by calling existing method instead of manually doing it * Update WebFetchersTest.java to include error on Logger Co-authored-by: Christoph <siedlerkiller@gmail.com> * Small coding style changes * Forgot a comma while commiting suggested modifications from Github UI * Another set of small corrections * QUICKFIX: fixed wrong output for CompositeIdFetcherTest on ArXiv calls When calling ArXiv fetcher, the 'CompositeIdFetcherTest' was returning wrong result (more specifically, a wrong 'keywords' field, with duplicate keywords). This was caused by a missing return value on the 'importFormatPreferences.getKeywordSeparator()' mock, which is used by ArXivFecther for removing duplicate keywords (which pretty much always happens during ArXivFecther's processing). Co-authored-by: Christoph <siedlerkiller@gmail.com>
- Loading branch information