-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend ArXiv fetcher results by using data from related DOIs #9170
Extend ArXiv fetcher results by using data from related DOIs #9170
Conversation
… from ArXiv-issued DOI The new Fetcher, 'ArXivWithDoi', implements the same interfaces as the 'ArXiv' fetcher and, for the most part, rely on calling the latter for complying with these interfaces. The actual innovation is that all BibEntries returned from 'ArXiv' are replaced by a merged version between it and a BibEntry from the DoiFetcher (with ArXiv's fields getting the priority) Additionally, two other changes were made: - Method 'merge()' to BibEntry, so that two BibEntries can be merged (with priority given to the current 'BibEntry' object) - FOR NOW, all references to the use of the 'ArXiv' class in other parts of the database (```WebFetchers```, ```ImportHandler``` and ```CompositeIdFetcher```) were replaced with the new 'ArXivWithDoi' class, so that all UI path lead to the use of this new version. Future commits may make its use optional with feature flags A small number of manual tests were made. More to come. Next commits should also integrate automated tests.
Below, you can see the difference between import with the old ArXiv fetcher, with the ArXiv-issued DOI and with the new ArXiv fetcher, respectively, for https://arxiv.org/abs/1811.10364. Note that when two fields clash, the chosen one is always the one from the old ArXiv fetcher. If you have some opinions about this choice, feel free to discuss them. @Comment{Import with the original 'ArXiv' fetcher}
@Article{Beel2018,
author = {Joeran Beel and Andrew Collins and Akiko Aizawa},
title = {The Architecture of Mr. DLib's Scientific Recommender-System API},
year = {2018},
month = nov,
abstract = {Recommender systems in academia are not widely available. This may be in part due to the difficulty and cost of developing and maintaining recommender systems. Many operators of academic products such as digital libraries and reference managers avoid this effort, although a recommender system could provide significant benefits to their users. In this paper, we introduce Mr. DLib's "Recommendations as-a-Service" (RaaS) API that allows operators of academic products to easily integrate a scientific recommender system into their products. Mr. DLib generates recommendations for research articles but in the future, recommendations may include call for papers, grants, etc. Operators of academic products can request recommendations from Mr. DLib and display these recommendations to their users. Mr. DLib can be integrated in just a few hours or days; creating an equivalent recommender system from scratch would require several months for an academic operator. Mr. DLib has been used by GESIS Sowiport and by the reference manager JabRef. Mr. DLib is open source and its goal is to facilitate the application of, and research on, scientific recommender systems. In this paper, we present the motivation for Mr. DLib, the architecture and details about the effectiveness. Mr. DLib has delivered 94m recommendations over a span of two years with an average click-through rate of 0.12%.},
archiveprefix = {arXiv},
eprint = {1811.10364},
file = {:http\://arxiv.org/pdf/1811.10364v1:PDF},
keywords = {cs.IR, cs.AI, cs.DL, cs.LG},
primaryclass = {cs.IR},
}
@Comment{Import with only the 'DOI' fetcher}
@Misc{Beel2018a,
author = {Beel, Joeran and Collins, Andrew and Aizawa, Akiko},
title = {The Architecture of Mr. DLib's Scientific Recommender-System API},
year = {2018},
copyright = {arXiv.org perpetual, non-exclusive license},
doi = {10.48550/ARXIV.1811.10364},
keywords = {Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Digital Libraries (cs.DL), Machine Learning (cs.LG), FOS: Computer and information sciences},
publisher = {arXiv},
}
@Comment{Import with the new 'ArXivWithDoi' fetcher}
@Article{Beel2018b,
author = {Joeran Beel and Andrew Collins and Akiko Aizawa},
title = {The Architecture of Mr. DLib's Scientific Recommender-System API},
year = {2018},
month = nov,
abstract = {Recommender systems in academia are not widely available. This may be in part due to the difficulty and cost of developing and maintaining recommender systems. Many operators of academic products such as digital libraries and reference managers avoid this effort, although a recommender system could provide significant benefits to their users. In this paper, we introduce Mr. DLib's "Recommendations as-a-Service" (RaaS) API that allows operators of academic products to easily integrate a scientific recommender system into their products. Mr. DLib generates recommendations for research articles but in the future, recommendations may include call for papers, grants, etc. Operators of academic products can request recommendations from Mr. DLib and display these recommendations to their users. Mr. DLib can be integrated in just a few hours or days; creating an equivalent recommender system from scratch would require several months for an academic operator. Mr. DLib has been used by GESIS Sowiport and by the reference manager JabRef. Mr. DLib is open source and its goal is to facilitate the application of, and research on, scientific recommender systems. In this paper, we present the motivation for Mr. DLib, the architecture and details about the effectiveness. Mr. DLib has delivered 94m recommendations over a span of two years with an average click-through rate of 0.12%.},
archiveprefix = {arXiv},
copyright = {arXiv.org perpetual, non-exclusive license},
doi = {10.48550/ARXIV.1811.10364},
eprint = {1811.10364},
file = {:http\://arxiv.org/pdf/1811.10364v1:PDF},
keywords = {cs.IR, cs.AI, cs.DL, cs.LG},
primaryclass = {cs.IR},
publisher = {arXiv},
} |
@thiagocferr "Import with the new 'ArXivWithDoi' fetcher" contains more information about the reference. That is nice! I am not sure about the field About which field to keep:
|
…Import and Export) One might what to toggle between a more simple response from ArXiv and a more complete one (with the additional information from the auto-assigned DOI). Maybe for a more concise Bibtex entry, maybe for lower number of request to web APIs. In either way, a boolean feature flag has been added for toggling between the new and old behavior (the **latter** is selected by default). It can be found in "Options > Preferences > Import and Export > General > Use ArXiv-issued DOI for complementing ArXiv entry" For this commit, the branching between new and old behavior was decided to be made inside the new 'ArXivWithDoi' fetcher (which effectively envelops the 'ArXiv' fetcher), relying on information being provided by an external 'Import preferences'. This decisions avoids repetitions in multiple places that directly use the fetcher.
About the
|
This argument only holds true if there are not multiple linked files to the entry, which can happen quite easily. The reverse also can happen: multiple files are linked to an entry, but only one url, so i think this argument does not really help in solving this dilemma. |
@ThiloteE Do you mean "keeping the one from the DOI fetcher"? Thank you for the rationale about the field |
@ThiloteE huh? |
oooh lol. Sry I wanted to tag @mlep 🤣 |
OK.... Let's to guess this straight...
@thiagocferr Do you mean "keeping the one from the DOI fetcher"? @ThiloteE Thank you for the rationale about the field |
Yes, my bad.
What exactly do you mean by that? If you mean searching using the link itself, I guess it doesn't work, as this error is logged when you use the link itself on the ArXiv Web search:
Using the ArXiv ID, it goes very fast for me...
Yes, I was wondering the same, but decided to just do it as both a learning exercise and a tool for helping me debug. I may just remove this when taking this PR for review.
Yes, I do think that having two files marked as the 'ArXiv fetcher' is kind of bad, and if we adhere to getting the maximum information possible, it would be bad if any other file could directly instantiate the fetcher with less information. However, I do think a separation between the class that does the heavy lifting (
This take is reasonable, indeed...
In a general setting, yes, this argument does not hold. However, the imported information from ArXiv seems to have an standard about this. Not to say that I've made extensive testing, but they all seem to include at least the Now, about the use of Because of all of this, I think we should really just leave the way it is: |
Having two files marked as the 'ArXiv fetcher' is kind of bad, and if we adhere to getting the maximum information possible, it would be bad if any other file could directly instantiate the fetcher with less information. At the same time, a separation between the class that does the heavy lifting (ArXiv.java) and the one who just does post-processing (ArXivWithDoi.java) (specially because it directly uses another fetcher) is required as good coding practice. This way, with this commit, only one class is kept: ArXivFetcher, a copy of ArXivWithDoi with the previous ArXiv class as an internal, private class.
Codewise it looks good so far. I would leave the storing of the url in linked files, in the importer JabRef can automatically download then. Please have a look at the failing tets. I think for the architecture tests you just have to rename as well or adapt to the new class names |
…I and contained an "External program change" bug Because the last commit had been generating some unpredictable errors (some of which I assume was related to the may changes made for the feature flag implementation), I "reseted" all changes (i.e. "git checkout main .") and just left the new unified ArXiv fetcher and changes to BibEntry, theoretically leaving this commit in a state similar to the previous one, but without the feature falg button. I also implemented a way to be selective of what fields from the DOI entry should overwrite the ones from ArXiv (for now, 'KEYWORDS' and 'AUTHOR') and changed the 'mergeWith()' function to apply changes to the current object, not returning a copy as before. Now, the thing that mostly took my time since the last commit was this weird bug: when saving a database with imported entries from the new ArXiv fetcher, a prompt "The library has been modified by another program." would always apper, prompting to accept some changes, which always included a modification to the newly added entry. This made no sense, as there was neither an involvment from an external program, nor a modification since manually saving the database. I seem to have found a possible very weird cause: this would always happen when setting the 'KEYWORD' field of the resulting BibEntry to the raw string from the DOI BibTex (as discussed before, it contains more detailed information, so it was included on the "prioritized fields" from DOI). The thing is, this string contained a duplicated "keyword", the FOS (that I suppose stands for "Field Of Subject" or similar) of the entry. You can see this behavior by making a GET request to https://doi.org/[ArXiv-assigned DOI] with header "Accept=application/x-bibtex". When removing this duplication, this bug suddenly disappeared (it showed once, but not since). Maybe future commits will include a more resolute fix for this bug, but the current fix cannot really affect the end result (as unique keywrods is what one would expect), so I leave at that for now.
So, I happened to get stuck on a weird bug since this commit (in reality, since this branch's first commit, according to my tests) and I made a change that seems to fix it, although I don't really get these are could possibly be related. From my latest commit:
You can see the related fix here. From previous debugging sessions, I looked for reasons why there were differences between the internal and disk database (and there were quite a few when inspecting both entries), but the discrepancies made no sense to me, so I reached the previous conclusion by try-and-error (i.e. seeing the results from gradually dropping the overwrite of each field for one specific ArXiv example). If you still see this behavior, please let me know the conditions in which this happened. I'd also appreciate if someone could help me reach a conclusion for the underline cause of this bug and if there would be a more concrete fix. Anyway, I think I'll be moving to fixing the failed tests next. |
Thanks for the investigation. This library modification thing is haunting us since a while already. I recently fixed most occurrences of them but noticed that with keywords as well. But I could not pinpoint it. Can you open a new issue with your findings? |
Beyond only new information coming from the always-consistent ArXiv-issued DOI entries, sometimes the ArXiv fetcher returns a DOI manually assigned by the publishing user, usually leading to other repositories (like ScienceDirect) and containing even more fields. As such, this commit tweaks th ArXiv fetcher to also include these new fields on the final result. As the actual structure can be, at first glance, unpredictable (since it's coming from diferent services), the only thing that actually overwrites the overall Bibtex entry (after merging with ArXiv-assigned DOI entry) is the DOI, as it represents a link to another repository. Future updates could revise this decision, including setting up a way for cataloging these different Bibtex structures and choosing which provider has the "best formatted field" (from ArXiv, ArXiv's DOI or external DOI)
@ThiloteE 2 things to consider in the rationale:
This concurs not to use the field |
What I mean is, the web search seems imperfect:
|
This commit was set to fix the ArXivFetcherTest so that tests could run and pass. This envolved several tweaks to the test file, including adding and updating manual BibEntries with the new fields, mocking more behavior for DoiFetcher, among other things. Now, all ENABLED tests from this file should pass. This process lead to some modifications being need on the ArXivFetcher, like: - Ignoring ArXiv-issued DOIs info when querying ArXiv's API for entries, as it doesn't return any entry (only user-issued ones are able to, for what I could piece together) - Replaced "JOURNALTITLE" field to simply "JOURNAL", as it was more standard across tests - Added "PUBLISHER" field as a priority for user-issued DOIs Futhermore, I realized a certain problem with using the "KEYWORDS" field from ArXiv-issued DOIs: currently, two of the category taxonomies (https://arxiv.org/category_taxonomy) that are included in their expanded (full) form have commas as part of their names: "cs.CE (Computational Engineering, Finance, and Science)" and " cs.DC (Distributed, Parallel, and Cluster Computing)". As we use commas as standard separators of keywords, this might be a problem... A solution could be envolving it in curly brackets, but this should be discussed beforehand.
When refactoring the name of the ArXiv fetcher, comments that included "ArXiv" were turned into "ArXivFetcher", even though some tests relied on the "ArXiv" string as the name of the fetcher (and still is, by the "getName()" method). This commit reverts this change.
Instead of sequentially calling the DOIFetcher (possibly twice) for every entry returned after querying the ArXiv API, make them happen in parallel, as they are completely unrelated to one another. This seem to significantly reduce processing time for larger paged results.
…ing time In the previous commit, I only parallelized the batch processing of entries that are part of a search query. Now, individual searchs for specific IDs are also sped up by parallel requests to the other sources that serve as complement to the original ArXiv entry (from ArXiv and user-issued DOIs). In other for that to happen, most of the previous code had to be adapted for parallel processing, specialy with the use of CompletableFutures. If these changes stay, more automated testing is expected, as well as an API retry system in case of API throttling.
So, I've re-implemented most of the main combined fetching process for allowing parallel API calls, which, from my experience, seems to reduce waiting time to about half of the previous implementation. Initially, I was going to only make parallel request with search queries (as it was really trivial to do), but I wanted to try more in-depth parallelization on Java and though the more responsiveness on the UI would be neat (and, from what it seems, it's a little better than before) Please, point it out if this approach would be problematic for some reason I might not know. If I were to continue this path, I may be looking for a way to do retries when API throttling happens and then doing more automated tests (specially regarding processing errors and how the code handles them). But I do have an issue that I don't really know what to do and want opinions on: the separation of the new keywords field. For some reason, ArXiv has only 2 Category Taxonomies with expanded names containing commas: Knowing this, what kind of approach would you suggest? Replacing commas with something else? Just removing it? Something else? @mlep @ThiloteE @Siedlerchr PS: How do I make the |
@Siedlerchr from my position, this PR is concluded. However, I don´t think an external code review was made. If necessary, I´ll make corrections to it. |
@thiagocferr Okay cool! Before merging a review of two maintainers is required. I will mark the PR as ready for review and try to go trough it in more detail in the next days. |
…ad of manually doing it
Co-authored-by: Christoph <siedlerkiller@gmail.com>
The file field having an https content is also supported by JabRef. One can (IMHO) directly click on the download the file. At least, I saw something in the code being able to handle URLs there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general: looks good. I just have some small code style comments.
src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jabref/logic/importer/fetcher/DoiFetcher.java
Outdated
Show resolved
Hide resolved
…f github.com:thiagocferr/jabref into 9092-arXiv-fetcher-import-more-information-using-DOI
Wait... is something going on with the CI server? It says Edit: Oh, I though that was the cause for tests failing, but it was an error somewhere else... |
You can ignore the gradle version. it's just a warning. No error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very small comments - otherwise: Good to go!
src/main/java/org/jabref/logic/importer/fetcher/DoiFetcher.java
Outdated
Show resolved
Hide resolved
Oh, wait, there's still a bug on |
When calling ArXiv fetcher, the 'CompositeIdFetcherTest' was returning wrong result (more specifically, a wrong 'keywords' field, with duplicate keywords). This was caused by a missing return value on the 'importFormatPreferences.getKeywordSeparator()' mock, which is used by ArXivFecther for removing duplicate keywords (which pretty much always happens during ArXivFecther's processing).
Ok, now it should be good to go. The errors on |
Thank you very much for your PR! |
* upstream/main: (55 commits) Update guidelines-for-setting-up-a-local-workspace.md Add link to good first issues (and fix link) Refine guidelines (and fix .md file linking) Extend ArXiv fetcher results by using data from related DOIs (JabRef#9170) New Crowdin updates (JabRef#9366) Fix token New translations JabRef_en.properties (Chinese Simplified) (JabRef#9364) Fix date field change causes an uncaught exception (JabRef#9365) Add date format and respective test case (JabRef#9310) Bump jackson-datatype-jsr310 from 2.13.4 to 2.14.0 (JabRef#9357) Bump jackson-dataformat-yaml from 2.13.4 to 2.14.0 (JabRef#9356) Bump tika-core from 2.5.0 to 2.6.0 (JabRef#9358) Refine guideline "Set up a local workspace" (JabRef#9355) Adds missing "requires" statement (JabRef#9354) Bind "Find unlinked files" default directory to the current library (JabRef#9290) Rename "Remote" to "Tele" (JabRef#9270) Fixed display of file field adjusting for the dark theme (JabRef#9343) Update sorting of entries in maintable by special fields immediately (JabRef#9338) New translations JabRef_en.properties (German) (JabRef#9336) Fix Abbreviation for Escaped Ampersand (JabRef#9288) ... # Conflicts: # CHANGELOG.md # src/main/java/org/jabref/preferences/AppearancePreferences.java # src/main/java/org/jabref/preferences/JabRefPreferences.java # src/test/java/org/jabref/gui/theme/ThemeManagerTest.java
* upstream/main: (40 commits) Update guidelines-for-setting-up-a-local-workspace.md Add link to good first issues (and fix link) Refine guidelines (and fix .md file linking) Extend ArXiv fetcher results by using data from related DOIs (#9170) New Crowdin updates (#9366) Fix token New translations JabRef_en.properties (Chinese Simplified) (#9364) Fix date field change causes an uncaught exception (#9365) Add date format and respective test case (#9310) Bump jackson-datatype-jsr310 from 2.13.4 to 2.14.0 (#9357) Bump jackson-dataformat-yaml from 2.13.4 to 2.14.0 (#9356) Bump tika-core from 2.5.0 to 2.6.0 (#9358) Refine guideline "Set up a local workspace" (#9355) Adds missing "requires" statement (#9354) Bind "Find unlinked files" default directory to the current library (#9290) Rename "Remote" to "Tele" (#9270) Fixed display of file field adjusting for the dark theme (#9343) Update sorting of entries in maintable by special fields immediately (#9338) New translations JabRef_en.properties (German) (#9336) Fix Abbreviation for Escaped Ampersand (#9288) ... # Conflicts: # CHANGELOG.md
Closes #9092
ArXiv
fetcher, but intercept all Bibtex entries and combines them with the corresponding Bibtex entry directly from its ArXiv-issued DOI (see this article)Create feature flag for toggling between new and old behaviorCHANGELOG.md
described in a way that is understandable for the average user (if applicable)Screenshots added in PR description (for UI changes)