-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding DOAB to web search #8598
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised to learn that it returns JSON as well. My browser showed XML...
@@ -103,7 +103,7 @@ public ArgumentProcessor(String[] args, Mode startupMode, PreferencesService pre | |||
ParserResult result = new ParserResult(entries); | |||
result.setToOpenTab(); | |||
return Optional.of(result); | |||
} catch (ParseException e) { | |||
} catch (ParseException | IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not add the IOExceptio here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not add the IOExceptio here
Sure I will remove the exceptions that I added, and move the method you mentioned.
I have a question, what do you think more attributes should be added to the resulting bibtex, the metadata of the API returns much more information, I looked at other fetchers and added the common attributes for a start, you can tell me if you would like other attributes to be added, for me I think a description should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should import all fields that can somehow be mapped to the BibTeX/biblatex
e.g. abstract (that is very useful) language, location, and maybe add the terms under other as keywords?
}; | ||
} | ||
|
||
private JSONArray toJsonArray(InputStream stream) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move that method also to the JsonReader Util class and handle the exception there. Then you don't need to throw and add an IO Exception everywhere
No, you are correct it returns xml I convert the xml to json and then to bibtex, it is easier for me to work with json, I should have explained that better in the description |
BibEntry entry = new BibEntry(); | ||
for (int i = 0; i < metadataArray.length(); i++) { | ||
JSONObject dataObject = metadataArray.getJSONObject(i); | ||
if (dataObject.getString("key").equals("dc.type")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you can simply this using a switch case
Extract the dataObject.get("key) to a variable and then you can do a nice switch on the variable
see https://www.baeldung.com/java-switch#switch-expressions and also
https://www.baeldung.com/java-switch-pattern-matching
@Siedlerchr Hi, |
Hi, sorry for the late resonse: For the authors: If each author is in it's own field you can directly use the new Author(...) for each; Otherwise you can use Authorlist.parse()... Have a look at the CrossRef Fetcher. jabref/src/main/java/org/jabref/logic/importer/fetcher/CrossRef.java Lines 149 to 162 in 6e99a33
|
@Siedlerchr Hi, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general looks already good. Some minor improvements
}; | ||
} | ||
|
||
private BibEntry JsonToBibEntry(JSONArray metadataArray) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowercase method name
BibEntry entry = JsonToBibEntry(metadataArray); | ||
return Collections.singletonList(entry); | ||
} | ||
// multiple results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment can be removed, shold be obvious
|
||
public class DOABFetcher implements SearchBasedParserFetcher { | ||
private static final String SEARCH_URL = "https://directory.doabooks.org/rest/search?"; | ||
// private static final String PEER_REVIEW_URL = " https://directory.doabooks.org/rest/peerReviews?"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how you want to integrate it, should it be a different fitcher, DOABPeerReviews for example or maybe in the same fitcher but the user has two options
I am actually not sure if I understand what peer reviews mean in the world of books.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would ignore it for the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the comment then
return entry; | ||
} | ||
|
||
private Author toAuthor(String author) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you have multiple authors?
https://directory.doabooks.org/handle/20.500.12854/79348?show=full
And I think you can use AuthorList.parse () as the names are already in bibtex format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 using :/AuthorList.parse
. There are a lot of weird edge-cases when it comes to authors
(see comment https://github.com/JabRef/jabref/pull/8598/files#r839925364 for a bit more discussion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you have multiple authors? https://directory.doabooks.org/handle/20.500.12854/79348?show=full And I think you can use AuthorList.parse () as the names are already in bibtex format
I created a list from the Author class and all authors are added to it then converted to an AuthorList I will add a test case for multiple authors, it works the same as multiple editors there a test case provide for that, but I agree with you I think there is a better solution for that, can you describe what AuthorList.parse () actually does and how to use it, I am not sure I completely understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nutshell interpretation of AuthorList
is that it is intended to represent one or more authors written in bibtex format.
AuthorList.parse
is intended to take a String
consisting of one or more authors in bibtex format, and store them internally in a List
that you can access directly through getAuthors()
if you need to.
For this case, I'd probably go with
authorsList.addAll(
AuthorList.parse(
dataObject.getString("value"))
.getAuthors())
even if we should only be getting one author from doab, and it isn't in bibtex format. Probably preprocess by checking for (Ed.) as in #8598 (comment)
Is that explanation/example useful?
You seem to already have nailed down how to deal with output from AuthorList. Do you have any questions regarding it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that was useful, and definitely a cleaner solution, I will implement this, and also start thinking about solutions to handle edgy cases like the one you mentioned here #8598 (comment) especially the author name that starts with Fatima which is formatted differently from the other authors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also noticed something for some results the field for the number of pages can come in a different format like this result here https://directory.doabooks.org/rest/search?query=%22UAV%E2%80%90Based%20Remote%20Sensing%20Volume%202%22&expand=metadata
this can be problematic because the type of the return is not an integer anymore it is a string.
public void TestPerformSearch() throws FetcherException { | ||
List<BibEntry> entries; | ||
entries = fetcher.performSearch("i open fire"); | ||
assertFalse(entries.isEmpty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertFalse and assertTrue make it hard to see why a test fails. In thise casse it's better you compare the entry collections.
e.g. assertEquals(Collections.singletonlist(expectedEntry),entries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure
List<Author> authorsList = new ArrayList<>(); | ||
List<Author> editorsList = new ArrayList<>(); | ||
StringJoiner keywordJoiner = new StringJoiner(","); | ||
for (int i = 0; i < metadataArray.length(); i++) { | ||
JSONObject dataObject = metadataArray.getJSONObject(i); | ||
switch (dataObject.getString("key")) { | ||
case "dc.contributor.author" -> authorsList.add(toAuthor(dataObject.getString("value"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of trying to deepen the comment in https://github.com/JabRef/jabref/pull/8598/files#r839909593
In what circumstance(s) is the metadataArray
larger than 1 item?
If it can not be guaranteed that metadataArray
is exactly 1 item (in which case we have only one author and one editor field) I'd probably suggest
authorsList.addAll(AuthorList.parse(dataObject.getString("value")).getAuthors())
even if it is a bit ugly :/ (which should also cover the weird edge case of there being an author field but no parsable author)
It is probably possible to use StringJoiner with :/and
, but it feels a bit hacky
I don't know, I guess the best-case scenario is if 😛metadataArray
is only one item, but I guess we can't always get what we wish for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it a bit closer, it seems that I've spoken too soon.
@Mohamadi98 do you know if there is any information on what format they are using with more details?
E.g.,
<metadata>
<key>dc.contributor.author</key>
<language>*</language>
<value>Felipe Gonzalez Toro (Ed.)</value>
</metadata>
<metadata>
<key>dc.contributor.author</key>
<language>*</language>
<value>Kamruzzaman, Md. (Liton)</value>
</metadata>
<metadata>
<key>dc.contributor.author</key>
<language>*</language>
<value>Fátima Velez de Castro</value>
</metadata>
is a bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also follow-up in #8576 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the data itself returned from the API is in xml format, and the actual information that I used for the Bibtex entries are inside an array called metadata which you can add or remove from the results, see this https://directory.doabooks.org/rest/search?query=%22the+deliverance+of+open+access+books%22, so I agree with you it is a little bit weird
I convert the format to json, It's easier for me to work with json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mohamadi98 Most entries seem to be already in correct format, however some seem to indicate Ediors in the authors field
We suggest you apply a minimal heuristic:
- Check if the dc.contirbutor.author contains (Ed.)
- if yes -> remove that (Ed.) part
- Parse the value now using Authorlist.parse() it takes care of different names formats
- Repeat for other editors
- Put in the editors field
The other normal case:
- Check if the dc.contirbutor.author contains (Ed.)
- If no -> Parse it using AuthorLIst.parse()
- Repeat for other authors
- Put in the authors field
It's a bit of a mess with a list of list of authors... and the naming doesn't really help (yep, it's a mess^^)
But if you're stuck, just ask @k3KAW8Pnf7mkmdSMPHz27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AuthorList.parse can also handle names not in (LastName, FirstName) format you will see in one of the tests in the latest commit that there are names that returns from the API in (FirstName LastName) and are parsed correctly, so this nice
private String toAuthorList(List<Author> authorsList) { | ||
return AuthorList.of(authorsList).getAsFirstLastNamesWithAnd(); | ||
private String namePreprocessing(String name) { | ||
name = String.valueOf(new StringBuilder(name).delete(name.length() - 5, name.length())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not a simple replace? e.g. name.replace("(Eds)","") ?
I think currently this works fine for most entries of the API, |
@Mohamadi98 Cool! Thanks for the work so far! I guess we can't handle all edge cases. For example, if you cannot parse the author correctly you should nontheless add the rest of the fields that are, so the user has the option to manually edit/correct the entries Otherwise, looks good already. It would also be nice to implement the FullTextFetcher Interface, so that the PDF can be automatically downloaded. You can also do this as an extra PR. |
Yes, actually all names are parsed correctly in comparison to the result from the API maybe not correct for the real life name but in that case as you said the user can edit it or correct it, I will just add a try catch for the pages number case, there are a few results where the the number of pages is in the following format "(roman number), number" which causes an error because in this format the type is String and not an integer like the normal format, so that try catch should handle that fine. |
Sure no problem, but to make sure I understand, the FullTextFetcher supposed to find the page of a Bib entry for example a book probably using their DOI or URI , is that correct or something else ? |
@Mohamadi98 Yes, a Full-text fetcher takes a bib entry, and then returns the PDF document for it e.g. by checking for the DOI or any other field (e.g. title, this depends on the concrete fetcher). Whenever possible, unique identifier like DOI or ISBN etc. should be used to find an entry as other fields can be ambiguos. When you have an entry in JabRef and say "Find full text online", JabRef calls multiple fetchers, until one returns the PDF file. The fetchers also have a Trust level, e.g. if the entry or the PDF comes from the publisher website, it is considered as more trustworthy as when it comes from a website like ResearchGate. |
DevCall decision: @Siedlerchr tries this code locally and if it works, we merge. The FullTextFetcher can then go into a follow-up pull request. |
An entry to the |
Searching for "politics". Found:
But when importing via DOI this emerges:
To do:
|
Also, see following entries:
The
|
@Mohamadi98 It would be cool if you could address the remaining comments so we can merge this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments to the test organization.
I would also like to see .withField
instead of .setField
. See #8693 for an example of the outcome.
@Test | ||
public void TestPerformSearch() throws FetcherException { | ||
List<BibEntry> entries; | ||
entries = fetcher.performSearch("i open fire"); | ||
assertEquals(Collections.singletonList(David_Opal), entries); | ||
} | ||
|
||
@Test | ||
public void TestPerformSearch2() throws FetcherException { | ||
List<BibEntry> entries; | ||
entries = fetcher.performSearch("the deliverance of open access books"); | ||
assertEquals(Collections.singletonList(Ronald_Snijder), entries); | ||
} | ||
|
||
@Test | ||
public void TestPerformSearch3() throws FetcherException { | ||
List<BibEntry> entries; | ||
entries = fetcher.performSearch("Four Kingdom Motifs before and beyond the Book of Daniel"); | ||
assertEquals(Collections.singletonList(Andrew_Perrin), entries); | ||
} | ||
|
||
@Test | ||
public void TestPerformSearch4() throws FetcherException { | ||
List<BibEntry> entries; | ||
entries = fetcher.performSearch("UAV Sensors for Environmental Monitoring"); | ||
assertEquals(Collections.singletonList(Felipe_Gonzalez), entries); | ||
} | ||
|
||
@Test | ||
public void TestPerformSearch5() throws FetcherException { | ||
List<BibEntry> entries; | ||
entries = fetcher.performSearch("The symbiosis between information system project complexity and information system project success"); | ||
assertEquals(Collections.singletonList(Carl_Marnewick), entries); | ||
} | ||
|
||
@Test | ||
public void TestPerformSearch6() throws FetcherException { | ||
List<BibEntry> entries; | ||
entries = fetcher.performSearch("UAV‐Based Remote Sensing Volume 2"); | ||
assertEquals(Collections.singletonList(Antonios_Tsourdos), entries); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be converted to a parameterized test --> https://www.baeldung.com/parameterized-tests-junit-5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be converted to a parameterized test --> https://www.baeldung.com/parameterized-tests-junit-5
Hi, do you mind taking a look at the test file for the last commit, I still want to use assertEquals or something similar to it but I couldn't use it with parameterized test, any advice on that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because now the value of the expected variable in the test has to change with every new test case from the string array in ValueSource under ParameterizedTest
} | ||
|
||
@Test | ||
public void TestGetName() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method names should start with a lower case:
public void TestGetName() { | |
public void testGetName() { |
okay, I think this could be done, can you just explain to me what affect does changing the type have, I don't think I understand that part about bibtex fully |
actually there is a field for volume in the API results, but it can be a little strange like for the example you mentioned there is no, but there is for other results, I think adding that field is a more generalized solution than parsing the volume number from the name string, it cause problems for maybe cases where the actual word "volume" is part of the name. I can't find a field for the ISBN in the API results, maybe I could communicate with Ronald from the issue here #8576 to ask about ISBN. For the abstract part, this is the format that comes directly from the API. |
Unfortunately, I can't find a field for subtitle in the API results, maybe it could be fetched from the webpage, but I don't know of a way to access webpages from jabref and fetch data from the page. |
sure, but can you explain the difference between "setfield" and "withfield". |
Fixing the entry type is comparatively important (More important than e.g. fetching the missing subtitle field). See here for a rough explanation: https://www.openoffice.org/bibliographic/bibtex-defs.html In short:
As a consequence, citation-styles will render the entries differently based on entry type. If the entry type is The entry type is offered in the API probably via the following:
You already fetched it, since it shows in the biblography as
All that needs to be done here is to do the conversion.
For this entry click on the bottom left show full item record, you will see info about ISBN:
Same procedure for this entry yields the subtitle:
and I think fetching "imprint" was also missing, so here we go:
Finally, with regard to:
These two issues are not important. I just noticed, but they are very low priority. They were more a question than a proposal. Don't do them. They do not prevent merging this pull request at all. Overall, the entries I fetched look already pretty good! Good job! |
|
5ef9e06
to
4042bcf
Compare
I added somethings, can you check the last commit and let me know if you have request or feedback also, I added oapen.relation.isbn so if that field is in the API result it will be fitched and assigned to the bibentry, however the problem is that despite you found it on the webpage it doesn't seem to appear in all API results I have seen including the example you mentioned, here is the API response for the query of the book you mentioned https://directory.doabooks.org/rest/search?query=%22Globalising%20Democracy%22&expand=metadata |
Note that the "secret sauce" of Parameterized tests is: return Stream.of(
Arguments.of( We applied that pattern at |
- Add CHANGELOG.md entry - Fix checkstyle Co-authored-by: Christoph <siedlerkiller@gmail.com> Co-authored-by: Carl Christian Snethlage <50491877+calixtus@users.noreply.github.com>
…there are two results) Co-authored-by: Christoph <siedlerkiller@gmail.com> Co-authored-by: Carl Christian Snethlage <50491877+calixtus@users.noreply.github.com>
203b152
to
d677bf1
Compare
@Mohamadi98 Thank you for your efforts in implementing this. Hope, you learned something here 🎉. Maybe, you can investiage on our final verison of the paramterized tests. @ThiloteE In case there still is something missing, please open a new issue. |
Thank you Mohamadi! :) |
By the way, if you write "Fixes #8576", It will automatically close the issue, when the pull request is merged :-) |
Definitely will look at it, It was a pleasure, thanks |
It was fun, Thank you ! |
CHANGELOG.md
described in a way that is understandable for the average user (if applicable)