Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend ArXiv fetcher results by using data from related DOIs #9170

Conversation

thiagocferr
Copy link
Contributor

@thiagocferr thiagocferr commented Sep 22, 2022

Closes #9092

  • Add new Fetcher that mimics the ArXiv fetcher, but intercept all Bibtex entries and combines them with the corresponding Bibtex entry directly from its ArXiv-issued DOI (see this article)
  • Create feature flag for toggling between new and old behavior
  • Implement selective choice of fields to merged result
  • Add more information from user-assigned DOIs
  • Decrease wait time by making additional API calls concurrently
  • Implement API rate limits for relevant fetchers

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

… from ArXiv-issued DOI

The new Fetcher, 'ArXivWithDoi', implements the same interfaces as the 'ArXiv' fetcher and, for the most part, rely on calling the latter for complying with these interfaces. The actual innovation is that all BibEntries returned from 'ArXiv' are replaced by a merged version between it and a BibEntry from the DoiFetcher (with ArXiv's fields getting the priority)

Additionally, two other changes were made:

- Method 'merge()' to BibEntry, so that two BibEntries can be merged (with priority given to the current 'BibEntry' object)
- FOR NOW, all references to the use of the 'ArXiv' class in other parts of the database (```WebFetchers```, ```ImportHandler``` and ```CompositeIdFetcher```) were replaced with the new 'ArXivWithDoi' class, so that all UI path lead to the use of this new version. Future commits may make its use optional with feature flags

A small number of manual tests were made. More to come. Next commits should also integrate automated tests.
@thiagocferr
Copy link
Contributor Author

Below, you can see the difference between import with the old ArXiv fetcher, with the ArXiv-issued DOI and with the new ArXiv fetcher, respectively, for https://arxiv.org/abs/1811.10364. Note that when two fields clash, the chosen one is always the one from the old ArXiv fetcher.

If you have some opinions about this choice, feel free to discuss them.

@Comment{Import with the original 'ArXiv' fetcher}
@Article{Beel2018,
  author        = {Joeran Beel and Andrew Collins and Akiko Aizawa},
  title         = {The Architecture of Mr. DLib's Scientific Recommender-System API},
  year          = {2018},
  month         = nov,
  abstract      = {Recommender systems in academia are not widely available. This may be in part due to the difficulty and cost of developing and maintaining recommender systems. Many operators of academic products such as digital libraries and reference managers avoid this effort, although a recommender system could provide significant benefits to their users. In this paper, we introduce Mr. DLib's "Recommendations as-a-Service" (RaaS) API that allows operators of academic products to easily integrate a scientific recommender system into their products. Mr. DLib generates recommendations for research articles but in the future, recommendations may include call for papers, grants, etc. Operators of academic products can request recommendations from Mr. DLib and display these recommendations to their users. Mr. DLib can be integrated in just a few hours or days; creating an equivalent recommender system from scratch would require several months for an academic operator. Mr. DLib has been used by GESIS Sowiport and by the reference manager JabRef. Mr. DLib is open source and its goal is to facilitate the application of, and research on, scientific recommender systems. In this paper, we present the motivation for Mr. DLib, the architecture and details about the effectiveness. Mr. DLib has delivered 94m recommendations over a span of two years with an average click-through rate of 0.12%.},
  archiveprefix = {arXiv},
  eprint        = {1811.10364},
  file          = {:http\://arxiv.org/pdf/1811.10364v1:PDF},
  keywords      = {cs.IR, cs.AI, cs.DL, cs.LG},
  primaryclass  = {cs.IR},
}

@Comment{Import with only the 'DOI' fetcher}
@Misc{Beel2018a,
  author    = {Beel, Joeran and Collins, Andrew and Aizawa, Akiko},
  title     = {The Architecture of Mr. DLib's Scientific Recommender-System API},
  year      = {2018},
  copyright = {arXiv.org perpetual, non-exclusive license},
  doi       = {10.48550/ARXIV.1811.10364},
  keywords  = {Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Digital Libraries (cs.DL), Machine Learning (cs.LG), FOS: Computer and information sciences},
  publisher = {arXiv},
}

@Comment{Import with the new 'ArXivWithDoi' fetcher}
@Article{Beel2018b,
  author        = {Joeran Beel and Andrew Collins and Akiko Aizawa},
  title         = {The Architecture of Mr. DLib's Scientific Recommender-System API},
  year          = {2018},
  month         = nov,
  abstract      = {Recommender systems in academia are not widely available. This may be in part due to the difficulty and cost of developing and maintaining recommender systems. Many operators of academic products such as digital libraries and reference managers avoid this effort, although a recommender system could provide significant benefits to their users. In this paper, we introduce Mr. DLib's "Recommendations as-a-Service" (RaaS) API that allows operators of academic products to easily integrate a scientific recommender system into their products. Mr. DLib generates recommendations for research articles but in the future, recommendations may include call for papers, grants, etc. Operators of academic products can request recommendations from Mr. DLib and display these recommendations to their users. Mr. DLib can be integrated in just a few hours or days; creating an equivalent recommender system from scratch would require several months for an academic operator. Mr. DLib has been used by GESIS Sowiport and by the reference manager JabRef. Mr. DLib is open source and its goal is to facilitate the application of, and research on, scientific recommender systems. In this paper, we present the motivation for Mr. DLib, the architecture and details about the effectiveness. Mr. DLib has delivered 94m recommendations over a span of two years with an average click-through rate of 0.12%.},
  archiveprefix = {arXiv},
  copyright     = {arXiv.org perpetual, non-exclusive license},
  doi           = {10.48550/ARXIV.1811.10364},
  eprint        = {1811.10364},
  file          = {:http\://arxiv.org/pdf/1811.10364v1:PDF},
  keywords      = {cs.IR, cs.AI, cs.DL, cs.LG},
  primaryclass  = {cs.IR},
  publisher     = {arXiv},
}

@mlep
Copy link
Contributor

mlep commented Sep 23, 2022

@thiagocferr
Thank you for tackling this issue.

"Import with the new 'ArXivWithDoi' fetcher" contains more information about the reference. That is nice!

I am not sure about the field file = {:http\://arxiv.org/pdf/1811.10364v1:PDF}. Should'nt it be converted to url = {http://arxiv.org/pdf/1811.10364v1} ? @ThiloteE : any suggestion?

About which field to keep:

  • I think we should keep the keywords provided by the DOI fetcher instead of the ones provided by original arXiv fetcher.
  • For authors with compound last names (such as in https://arxiv.org/abs/2209.11222), the format provided by the DOI fetcher is better. Please, keep this one.

…Import and Export)

One might what to toggle between a more simple response from ArXiv and a more complete one (with the additional information from the auto-assigned DOI). Maybe for a more concise Bibtex entry, maybe for lower number of request to web APIs. In either way, a boolean feature flag has been added for toggling between the new and old behavior (the **latter** is selected by default).

It can be found in "Options > Preferences > Import and Export > General > Use ArXiv-issued DOI for complementing ArXiv entry"

For this commit, the branching between new and old behavior was decided to be made inside the new 'ArXivWithDoi' fetcher (which effectively envelops the 'ArXiv' fetcher), relying on information being provided by an external 'Import preferences'. This decisions avoids repetitions in multiple places that directly use the fetcher.
@thiagocferr
Copy link
Contributor Author

thiagocferr commented Sep 23, 2022

@mlep

About the file vs url fields

In the JabRef UI, having one or the other allows for quicker access to the file from either

  1. The entry's General tab;
  2. The Entry Table's columns Linked files (for field file) / Linked identifiers (field url)

The thing is, by replacing file with url, the direct reference to the file is no longer on the Linked files (represented by a 'file' symbol) column and goes to the Linked identifiers (a 'globe' symbol). However, this field contain more than one element, like access links to the ArXiv page (not the PDF), so accessing the PDF file is a two-step process of clicking the field, opening a small menu, and then clicking the url item opens the PDF. This is shown on the image below:

image

By using the file field, on the other hand, you can more easily access the PDF from the Entry Table by simply pressing the 'PDF' icon, while leaving the other field to access link to ArXiv.org. This can be see below as well:

image

Although I see this a better workflow, I think having both file and url may have broader usefulness, both programmatically (as in having the field on the Bibtex file) and for reducing the number of Entry Table columns. What do you think?

About fields to keep

  • Yes, the DOI fetcher seems to get more "human-friendly" information about keywords. If this is the intended direction, should it also remove the abbreviation that might come from the original ArXiv entry (i.e. "(cs.IR)" from "Information Retrieval (cs.IR)")?

  • It does seem to seem that the last name, first name is easier on Bibtex, so I'll be leaving the one from the DOI fetcher.

@ThiloteE
Copy link
Member

ThiloteE commented Sep 23, 2022

I just had a look at this pr.

  1. It took me ages to search for https://arxiv.org/abs/1811.10364 with ArXiv websearch. - Is this normal?

  2. I think adding the following preference would probably be overkill:
    grafik
    Following the principle of "fetching as much metadata as possible", it should be sufficient to merge the old ArXiv fetcher and the new implementation, as you attempt to do and simply make this the default :-)

  3. My thoughts aboutfile field:

    • If you have downloaded the file to your local harddrive and the link points to the path on your harddrive, then the file field should be used.
    • If the url directly refers to a pdf on a webpage, i think the file field is fair enough.
      Example: https://arxiv.org/pdf/1811.10364v1.pdf
    • If the url leads to a normal webpage (not directly to the pdf), then I would put it into the url field :-)
      Example: https://arxiv.org/abs/1811.10364v1

@ThiloteE
Copy link
Member

ThiloteE commented Sep 23, 2022

The thing is, by replacing file with url, the direct reference to the file is no longer on the Linked files (represented by a 'file' symbol) column and goes to the Linked identifiers (a 'globe' symbol). However, this field contain more than one element, like access links to the ArXiv page (not the PDF), so accessing the PDF file is a two-step process of clicking the field, opening a small menu, and then clicking the url item opens the PDF. This is shown on the image below:

This argument only holds true if there are not multiple linked files to the entry, which can happen quite easily. The reverse also can happen: multiple files are linked to an entry, but only one url, so i think this argument does not really help in solving this dilemma.

@mlep
Copy link
Contributor

mlep commented Sep 23, 2022

  • It does seem to seem that the last name, first name is easier on Bibtex, so I'll be leaving the one from the DOI fetcher.

@ThiloteE Do you mean "keeping the one from the DOI fetcher"?

Thank you for the rationale about the field file. Difficult to decide... So I suggest we leave the file field as it is, as long as users do not raise the issue.

@ThiloteE
Copy link
Member

ThiloteE commented Sep 23, 2022

@btut @mlep i think you meant to tag thiagocferr

@btut
Copy link
Contributor

btut commented Sep 23, 2022

@ThiloteE huh?

@ThiloteE
Copy link
Member

oooh lol. Sry I wanted to tag @mlep 🤣

@mlep
Copy link
Contributor

mlep commented Sep 23, 2022

OK.... Let's to guess this straight...

  • It does seem to seem that the last name, first name is easier on Bibtex, so I'll be leaving the one from the DOI fetcher.

@thiagocferr Do you mean "keeping the one from the DOI fetcher"?

@ThiloteE Thank you for the rationale about the field file. Difficult to decide... So I suggest we leave the file field as it is, as long as users do not raise the issue.

@thiagocferr
Copy link
Contributor Author

It does seem to seem that the last name, first name is easier on Bibtex, so I'll be leaving the one from the DOI fetcher.

@thiagocferr Do you mean "keeping the one from the DOI fetcher"?

Yes, my bad.

@ThiloteE

  1. It took me ages to search for https://arxiv.org/abs/1811.10364 with ArXiv websearch. - Is this normal?

What exactly do you mean by that? If you mean searching using the link itself, I guess it doesn't work, as this error is logged when you use the link itself on the ArXiv Web search:

[2022-09-23 17:07:17 [pool-2-thread-1] org.jabref.logic.importer.fetcher.transformers.AbstractQueryTransformer.transform()
ERROR: Unsupported case when transforming the query:
 <regexp field='https' term=''/>
2022-09-23 17:07:17 [pool-2-thread-1] org.jabref.logic.importer.fetcher.transformers.AbstractQueryTransformer.transform()
ERROR: Unsupported case when transforming the query:
 <regexp field='default' term='abs'/>](https://arxiv.org/abs/1811.10364)

Using the ArXiv ID, it goes very fast for me...

  1. I think adding the following preference would probably be overkill:

Yes, I was wondering the same, but decided to just do it as both a learning exercise and a tool for helping me debug. I may just remove this when taking this PR for review.

Following the principle of "fetching as much metadata as possible", it should be sufficient to merge the old ArXiv fetcher and the new implementation, as you attempt to do and simply make this the default :-)

Yes, I do think that having two files marked as the 'ArXiv fetcher' is kind of bad, and if we adhere to getting the maximum information possible, it would be bad if any other file could directly instantiate the fetcher with less information.

However, I do think a separation between the class that does the heavy lifting (ArXiv.java) and the one who just does post-processing (ArXivWithDoi.java) (specially because it directly uses another fetcher) is required as good coding practice. So, I'm about to make another commit doing exactly this: keep only one class, ArXivFetcher, a copy of ArXivWithDoi with the previous ArXiv class as an internal, private class.

3. My thoughts about`file` field:
   
   * If you have downloaded the file to your local harddrive and the link points to the path on your harddrive, then the file field should be used.
   * If the url directly refers to a pdf on a webpage, i think the file field is fair enough.
     Example: `https://arxiv.org/pdf/1811.10364v1.pdf`
   * If the url leads to a normal webpage (not directly to the pdf), then I would put it into the url field :-)
     Example: `https://arxiv.org/abs/1811.10364v1`

This take is reasonable, indeed...

The thing is, by replacing file with url, the direct reference to the file is no longer on the Linked files (represented by a 'file' symbol) column and goes to the Linked identifiers (a 'globe' symbol). However, this field contain more than one element, like access links to the ArXiv page (not the PDF), so accessing the PDF file is a two-step process of clicking the field, opening a small menu, and then clicking the url item opens the PDF. This is shown on the image below:

This argument only holds true if there are not multiple linked files to the entry, which can happen quite easily. The reverse also can happen: multiple files are linked to an entry, but only one url, so i think this argument does not really help in solving this dilemma.

In a general setting, yes, this argument does not hold. However, the imported information from ArXiv seems to have an standard about this. Not to say that I've made extensive testing, but they all seem to include at least the Eprint field, which occupies the Linked identifier column (and occasional a DOI field; the ones manually introduced, that is). If we were to convert the file to url, this would create at least 2 Linked identifier (and that little submenu), but no direct access via Linked files. This is why I think eliminating file would be detrimental in this situation.

Now, about the use of url, I think this could be a good addition. It could be (optionally) used in Bibtex references (together with access date) and is, overall, more registered information. However, I think this change is out of the scope of this PR, as it could really be done with only the information retrieved from the ArXiv API (the file field is filled by default), and this PR should only focus on the merging of the new info from the DOI API.

Because of all of this, I think we should really just leave the way it is: file with no url, for now.

Having two files marked as the 'ArXiv fetcher' is kind of bad, and if we adhere to getting the maximum information possible, it would be bad if any other file could directly instantiate the fetcher with less information.

At the same time, a separation between the class that does the heavy lifting (ArXiv.java) and the one who just does post-processing (ArXivWithDoi.java) (specially because it directly uses another fetcher) is required as good coding practice. This way, with this commit, only one class is kept: ArXivFetcher, a copy of ArXivWithDoi with the previous ArXiv class as an internal, private class.
@Siedlerchr
Copy link
Member

Codewise it looks good so far. I would leave the storing of the url in linked files, in the importer JabRef can automatically download then. Please have a look at the failing tets. I think for the architecture tests you just have to rename as well or adapt to the new class names

…I and contained an "External program change" bug

Because the last commit had been generating some unpredictable errors (some of which I assume was related to the may changes made for the feature flag implementation), I "reseted" all changes (i.e. "git checkout main .") and just left the new unified ArXiv fetcher and changes to BibEntry, theoretically leaving this commit in a state similar to the previous one, but without the feature falg button.

I also implemented a way to be selective of what fields from the DOI entry should overwrite the ones from ArXiv (for now, 'KEYWORDS' and 'AUTHOR') and changed the 'mergeWith()' function to apply changes to the current object, not returning a copy as before.

Now, the thing that mostly took my time since the last commit was this weird bug: when saving a database with imported entries from the new ArXiv fetcher, a prompt "The library has been modified by another program." would always apper, prompting to accept some changes, which always included a modification to the newly added entry. This made no sense, as there was neither an involvment from an external program, nor a modification since manually saving the database.

I seem to have found a possible very weird cause: this would always happen when setting the 'KEYWORD' field of the resulting BibEntry to the raw string from the DOI BibTex (as discussed before, it contains more detailed information, so it was included on the "prioritized fields" from DOI). The thing is, this string contained a duplicated "keyword", the FOS (that I suppose stands for "Field Of Subject" or similar) of the entry. You can see this behavior by making a GET request to https://doi.org/[ArXiv-assigned DOI] with header "Accept=application/x-bibtex". When removing this duplication, this bug suddenly disappeared (it showed once, but not since).

Maybe future commits will include a more resolute fix for this bug, but the current fix cannot really affect the end result (as unique keywrods is what one would expect), so I leave at that for now.
@thiagocferr
Copy link
Contributor Author

So, I happened to get stuck on a weird bug since this commit (in reality, since this branch's first commit, according to my tests) and I made a change that seems to fix it, although I don't really get these are could possibly be related.

From my latest commit:

Now, the thing that mostly took my time since the last commit was this weird bug: when saving a database with imported entries from the new ArXiv fetcher, a prompt "The library has been modified by another program." would always apper, prompting to accept some changes, which always included a modification to the newly added entry. This made no sense, as there was neither an involvment from an external program, nor a modification since manually saving the database.

I seem to have found a possible very weird cause: this would always happen when setting the 'KEYWORD' field of the resulting BibEntry to the raw string from the DOI BibTex (as discussed before, it contains more detailed information, so it was included on the "prioritized fields" from DOI). The thing is, this string contained a duplicated "keyword", the FOS (that I suppose stands for "Field Of Subject" or similar) of the entry. You can see this behavior by making a GET request to https://doi.org/[ArXiv-assigned DOI] with header "Accept=application/x-bibtex". When removing this duplication, this bug suddenly disappeared (it showed once, but not since).

Maybe future commits will include a more resolute fix for this bug, but the current fix cannot really affect the end result (as unique keywrods is what one would expect), so I leave at that for now.

You can see the related fix here.

From previous debugging sessions, I looked for reasons why there were differences between the internal and disk database (and there were quite a few when inspecting both entries), but the discrepancies made no sense to me, so I reached the previous conclusion by try-and-error (i.e. seeing the results from gradually dropping the overwrite of each field for one specific ArXiv example).

If you still see this behavior, please let me know the conditions in which this happened. I'd also appreciate if someone could help me reach a conclusion for the underline cause of this bug and if there would be a more concrete fix. Anyway, I think I'll be moving to fixing the failed tests next.

@Siedlerchr
Copy link
Member

Siedlerchr commented Sep 27, 2022

: this would always happen when setting the 'KEYWORD' field of the resulting BibEntry to the raw string from the DOI BibTex (as discussed before, it contains more detailed information, so it was included on the "prioritized fields" from DOI). The thing is, this string contained a duplicated "keyword", the FOS (that I suppose stands for "Field Of Subject" or similar) of the entry.

Thanks for the investigation. This library modification thing is haunting us since a while already. I recently fixed most occurrences of them but noticed that with keywords as well. But I could not pinpoint it. Can you open a new issue with your findings?

Beyond only new information coming from the always-consistent ArXiv-issued DOI entries, sometimes the ArXiv fetcher returns a DOI manually assigned by the publishing user, usually leading to other repositories (like ScienceDirect) and containing even more fields. As such, this commit tweaks th ArXiv fetcher to also include these new fields on the final result.

As the actual structure can be, at first glance, unpredictable (since it's coming from diferent services), the only thing that actually overwrites the overall Bibtex entry (after merging with ArXiv-assigned DOI entry) is the DOI, as it represents a link to another repository. Future updates could revise this decision, including setting up a way for cataloging these different Bibtex structures and choosing which provider has the "best formatted field" (from ArXiv, ArXiv's DOI or external DOI)
@mlep
Copy link
Contributor

mlep commented Sep 28, 2022

@ThiloteE Thank you for the rationale about the field file. Difficult to decide... So I suggest we leave the file field as it is, as long as users do not raise the issue.

@ThiloteE 2 things to consider in the rationale:

  1. In the entry editor, for the field File, when you click on the "+" icon, a window opens and let you select a file reachable by your OS. You cannot directly set a URL (to do this, you have to edit an existing item and paste the URL in place of the file path, which works, but is a weird to do). Hence, currently, the field File is designed for actual files (i.e. a reachable path on your system), not a link to the web.

  2. the biblatex documentation specifies that:

file field (verbatim)
A local link to a pdf or other version of the work. Not used by the standard bibliography styles.

This concurs not to use the field file for URLs. Hence, the file field provided by the 'ArXiv' fetcher should be converted to an url field.

@ThiloteE
Copy link
Member

  1. It took me ages to search for https://arxiv.org/abs/1811.10364 with ArXiv websearch. - Is this normal?

What exactly do you mean by that? If you mean searching using the link itself, I guess it doesn't work, as this error is logged when you use the link itself on the ArXiv Web search:

[2022-09-23 17:07:17 [pool-2-thread-1] org.jabref.logic.importer.fetcher.transformers.AbstractQueryTransformer.transform()
ERROR: Unsupported case when transforming the query:
 <regexp field='https' term=''/>
2022-09-23 17:07:17 [pool-2-thread-1] org.jabref.logic.importer.fetcher.transformers.AbstractQueryTransformer.transform()
ERROR: Unsupported case when transforming the query:
 <regexp field='default' term='abs'/>](https://arxiv.org/abs/1811.10364)

Using the ArXiv ID, it goes very fast for me...

What I mean is, the web search seems imperfect:

  • Searching for "The Architecture of Mr. DLib's Scientific Recommender-System API" takes long (I canceled) ❌
  • Searching for "https://arxiv.org/abs/1811.10364" takes long (I canceled) ❌
  • Searching for "1811.10364" is fast ✔️
  • Searching for "arXiv:1811.10364" takes long (I canceled) ❌
  • Searching for "https://doi.org/10.48550/arXiv.1811.10364" takes long (I canceled) ❌
  • Searching for "Joeran Beel, Andrew Collins, Akiko Aizawa" takes a little, but will get found eventually 📶

@Siedlerchr
Copy link
Member

@mlep JabRef will (if activated) automatically download the pdf file on import and replace the entry in the file field as far as I know

@ThiloteE The web search != the new entry by id. Web search is for searching by author or title. For ids -> New entry from id

This commit was set to fix the ArXivFetcherTest so that tests could run and pass. This envolved several tweaks to the test file, including adding and updating manual BibEntries with the new fields, mocking more behavior for DoiFetcher, among other things. Now, all ENABLED tests from this file should pass.

This process lead to some modifications being need on the ArXivFetcher, like:

- Ignoring ArXiv-issued DOIs info when querying ArXiv's API for entries, as it doesn't return any entry (only user-issued ones are able to, for what I could piece together)
- Replaced "JOURNALTITLE" field to simply "JOURNAL", as it was more standard across tests
- Added "PUBLISHER" field as a priority for user-issued DOIs

Futhermore, I realized a certain problem with using the "KEYWORDS" field from ArXiv-issued DOIs: currently, two of the category taxonomies (https://arxiv.org/category_taxonomy) that are included in their expanded (full) form have commas as part of their names: "cs.CE (Computational Engineering, Finance, and Science)" and "
cs.DC (Distributed, Parallel, and Cluster Computing)". As we use commas as standard separators of keywords, this might be a problem... A solution could be envolving it in curly brackets, but this should be discussed beforehand.
When refactoring the name of the ArXiv fetcher, comments that included "ArXiv" were turned into "ArXivFetcher", even though some tests relied on the "ArXiv" string as the name of the fetcher (and still is, by the "getName()" method). This commit reverts this change.
Instead of sequentially calling the DOIFetcher (possibly twice) for every entry returned after querying the ArXiv API, make them happen in parallel, as they are completely unrelated to one another. This seem to significantly reduce processing time for larger paged results.
…ing time

In the previous commit, I only parallelized the batch processing of entries that are part of a search query. Now, individual searchs for specific IDs are also sped up by parallel requests to the other sources that serve as complement to the original ArXiv entry (from ArXiv and user-issued DOIs).

In other for that to happen, most of the previous code had to be adapted for parallel processing, specialy with the use of CompletableFutures. If these changes stay, more automated testing is expected, as well as an API retry system in case of API throttling.
@thiagocferr
Copy link
Contributor Author

thiagocferr commented Oct 3, 2022

So, I've re-implemented most of the main combined fetching process for allowing parallel API calls, which, from my experience, seems to reduce waiting time to about half of the previous implementation. Initially, I was going to only make parallel request with search queries (as it was really trivial to do), but I wanted to try more in-depth parallelization on Java and though the more responsiveness on the UI would be neat (and, from what it seems, it's a little better than before)

Please, point it out if this approach would be problematic for some reason I might not know. If I were to continue this path, I may be looking for a way to do retries when API throttling happens and then doing more automated tests (specially regarding processing errors and how the code handles them).


But I do have an issue that I don't really know what to do and want opinions on: the separation of the new keywords field.

For some reason, ArXiv has only 2 Category Taxonomies with expanded names containing commas: cs.CE (Computational Engineering, Finance, and Science) and cs.DC (Distributed, Parallel, and Cluster Computing). These extended names are what goes into the new keywords field, so if we assume the default keywords separator being comma, JabRef will wrongly interpret what are the specific keywords.

Knowing this, what kind of approach would you suggest? Replacing commas with something else? Just removing it? Something else? @mlep @ThiloteE @Siedlerchr

PS: How do I make the Deployment / Create installer and portable version for macOS (pull_request) test pass anyway?

@thiagocferr
Copy link
Contributor Author

@Siedlerchr from my position, this PR is concluded. However, I don´t think an external code review was made. If necessary, I´ll make corrections to it.

@Siedlerchr Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Nov 1, 2022
@Siedlerchr
Copy link
Member

@thiagocferr Okay cool! Before merging a review of two maintainers is required. I will mark the PR as ready for review and try to go trough it in more detail in the next days.

Co-authored-by: Christoph <siedlerkiller@gmail.com>
@koppor
Copy link
Member

koppor commented Nov 7, 2022

About the file vs url fields

The file field having an https content is also supported by JabRef. One can (IMHO) directly click on the download the file. At least, I saw something in the code being able to handle URLs there.

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general: looks good. I just have some small code style comments.

.idea/runConfigurations/JabRef_Main.xml Outdated Show resolved Hide resolved
src/main/java/org/jabref/model/entry/BibEntry.java Outdated Show resolved Hide resolved
src/main/java/org/jabref/model/entry/BibEntry.java Outdated Show resolved Hide resolved
src/main/java/org/jabref/model/entry/BibEntry.java Outdated Show resolved Hide resolved
…f github.com:thiagocferr/jabref into 9092-arXiv-fetcher-import-more-information-using-DOI
@thiagocferr
Copy link
Contributor Author

thiagocferr commented Nov 8, 2022

Wait... is something going on with the CI server? It says Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0., but it uses Gradle 7.5.1...

Edit: Oh, I though that was the cause for tests failing, but it was an error somewhere else...

@Siedlerchr
Copy link
Member

You can ignore the gradle version. it's just a warning. No error

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very small comments - otherwise: Good to go!

src/main/java/org/jabref/model/entry/BibEntry.java Outdated Show resolved Hide resolved
src/main/java/org/jabref/model/entry/BibEntry.java Outdated Show resolved Hide resolved
@thiagocferr
Copy link
Contributor Author

Oh, wait, there's still a bug on CompositeIdFetcherTest. It should be a quick fix.

When calling ArXiv fetcher, the 'CompositeIdFetcherTest' was returning
wrong result (more specifically, a wrong 'keywords' field, with
duplicate keywords). This was caused by a missing return value on the
'importFormatPreferences.getKeywordSeparator()' mock, which is used by
ArXivFecther for removing duplicate keywords (which pretty much always
happens during ArXivFecther's processing).
@thiagocferr thiagocferr changed the title New Fetcher that mimics ArXiv fetcher, but imports more information from DOI Extend ArXiv fetcher results by using data from related DOIs Nov 9, 2022
@thiagocferr
Copy link
Contributor Author

Ok, now it should be good to go. The errors on Fetcher Test don't seem to be related to this PR.

@Siedlerchr
Copy link
Member

Thank you very much for your PR!

@Siedlerchr Siedlerchr merged commit 690ce4b into JabRef:main Nov 10, 2022
Siedlerchr added a commit to shafinkamal/jabref that referenced this pull request Nov 11, 2022
* upstream/main: (55 commits)
  Update guidelines-for-setting-up-a-local-workspace.md
  Add link to good first issues (and fix link)
  Refine guidelines (and fix .md file linking)
  Extend ArXiv fetcher results by using data from related DOIs (JabRef#9170)
  New Crowdin updates (JabRef#9366)
  Fix token
  New translations JabRef_en.properties (Chinese Simplified) (JabRef#9364)
  Fix date field change causes an uncaught exception (JabRef#9365)
  Add date format and respective test case (JabRef#9310)
  Bump jackson-datatype-jsr310 from 2.13.4 to 2.14.0 (JabRef#9357)
  Bump jackson-dataformat-yaml from 2.13.4 to 2.14.0 (JabRef#9356)
  Bump tika-core from 2.5.0 to 2.6.0 (JabRef#9358)
  Refine guideline "Set up a local workspace" (JabRef#9355)
  Adds missing "requires" statement (JabRef#9354)
  Bind "Find unlinked files" default directory to the current library (JabRef#9290)
  Rename "Remote" to "Tele" (JabRef#9270)
  Fixed display of file field adjusting for the dark theme (JabRef#9343)
  Update sorting of entries in maintable by special fields immediately (JabRef#9338)
  New translations JabRef_en.properties (German) (JabRef#9336)
  Fix Abbreviation for Escaped Ampersand (JabRef#9288)
  ...

# Conflicts:
#	CHANGELOG.md
#	src/main/java/org/jabref/preferences/AppearancePreferences.java
#	src/main/java/org/jabref/preferences/JabRefPreferences.java
#	src/test/java/org/jabref/gui/theme/ThemeManagerTest.java
Siedlerchr added a commit that referenced this pull request Nov 12, 2022
* upstream/main: (40 commits)
  Update guidelines-for-setting-up-a-local-workspace.md
  Add link to good first issues (and fix link)
  Refine guidelines (and fix .md file linking)
  Extend ArXiv fetcher results by using data from related DOIs (#9170)
  New Crowdin updates (#9366)
  Fix token
  New translations JabRef_en.properties (Chinese Simplified) (#9364)
  Fix date field change causes an uncaught exception (#9365)
  Add date format and respective test case (#9310)
  Bump jackson-datatype-jsr310 from 2.13.4 to 2.14.0 (#9357)
  Bump jackson-dataformat-yaml from 2.13.4 to 2.14.0 (#9356)
  Bump tika-core from 2.5.0 to 2.6.0 (#9358)
  Refine guideline "Set up a local workspace" (#9355)
  Adds missing "requires" statement (#9354)
  Bind "Find unlinked files" default directory to the current library (#9290)
  Rename "Remote" to "Tele" (#9270)
  Fixed display of file field adjusting for the dark theme (#9343)
  Update sorting of entries in maintable by special fields immediately (#9338)
  New translations JabRef_en.properties (German) (#9336)
  Fix Abbreviation for Escaped Ampersand (#9288)
  ...

# Conflicts:
#	CHANGELOG.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fetcher status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

arXiv fetcher: import more information using DOI
6 participants