Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetcher for IACR eprints #3473

Merged
merged 25 commits into from
Dec 13, 2017
Merged

Conversation

derTimme
Copy link
Contributor

@derTimme derTimme commented Nov 30, 2017

I'm working on a new fetcher for IACR eprints, as I need a lot of these for my current work.

IACR does not provide a real API, so the information needs to be parsed out of their (really simple) HTML.

@devs: Are there any special requirements for fetchers? (I couldn't find any, but the whole HTML parsing thing might be a problem concerning maintainability - on the other hand, I use IACR for quite a while now and the web interface hasn't changed (visibly at least)).

The idea is to have the following features:

  • Fetch based on ID (like 2017/1556)
  • Fetched entry is of type misc and contains:
    • Title
    • Authors
    • Url (contains the version)
    • Note with \url{https://eprint.iacr.org/<identifier>} (recommended by IACR)
    • Howpublished like Cryptology ePrint Archive, Report <identifier> (recommended by IACR)
    • Version (most recent)
    • Date (of the most recent version)
    • Abstract
  • The encoding fits the library encoding (see b76a3f7, e. g. 2017/1095)

The following could also make sense:

  • Get fulltext (if that means to download the pdf - I haven't used this feature with other fetchers yet, so I don't really know what exactly it does)
  • Get the "also published in" information (see e.g. this paper) and DOI, if present (e.g. this paper)
  • Get keywords
  • Search in IACR preprints (e.g. for title) - but I suspect this would require us to download the complete archive or somehow "fill out" their search form (by crafting the corresponding POST request) programatically

  • Change in CHANGELOG.md described
  • Tests created for changes
  • Manually tested changed features in running JabRef
  • Check documentation status (Issue created for outdated help page at help.jabref.org?)
  • If you changed the localization: Did you run gradle localizationUpdate?

Note: This is WIP, the fetcher kind of worked in a few ad-hoc tests,
but it certainly isn't ready for production yet!
return formattedDates.get(0);
}

private static String getValueBetween(String from, String to, String haystack) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could see if there is already a method in StringUtils

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, there was indeed one. Keeping the functionality in a separate method though to have the error handling (nothing found) in one place.

if (dateMatcher.find()) {
Date date = DATE_FORMAT_WEBSITE.parse(dateMatcher.group(1));
formattedDates.add(DATE_FORMAT_BIBTEX.format(date));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always catch the most specific exceptions.

@Siedlerchr
Copy link
Member

Hi,
I think the html parsing is not really a problem. We already use that in other cases as well. You could also take a look at jsoup for parsing html contents/finding dat between elements (should be already included in our deps)

@derTimme
Copy link
Contributor Author

derTimme commented Nov 30, 2017

The problem with using libraries to retrieve the information from html is that their html seems to be somewhat buggy (using <p /> for example).

By moving all the "what is the content of field xy?" logic to methods,
the setAdditionalFields method gives a good overview over which fields
are set.
The problem is that the matching of strings (find abstract, ... in the HTML)
is done against strings in the source code - therefore they are encoded with
the source code encoding.
The downloaded HTML however should be encoded with whatever the user selected
for her library. But then, the matching fails.
So this needs further investigation.
@derTimme
Copy link
Contributor Author

derTimme commented Dec 1, 2017

From my point of view, the fetcher with its core features is finished.
The additional features mentioned in the initial comment are something I might tackle later on.

@derTimme derTimme changed the title [WIP] Fetcher for IACR eprints Fetcher for IACR eprints Dec 1, 2017
Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me! Thanks for your contribution. I give my ok for merge, but it would be nice if you could migrate the added tests to junit 5 before.

import org.jabref.model.entry.FieldName;
import org.jabref.testutils.category.FetcherTest;

import org.junit.Before;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just started to use JUnit 5 and it would be nice if you could use the new api (some of the other fetcher tests are already migrated).

import static org.junit.Assert.*;
import static org.mockito.Mockito.mock;

@Category(FetcherTest.class)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When converting to JUnit 5, please replace the category by the FetcherTest annotation.

@tobiasdiez tobiasdiez added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 1, 2017
@@ -37,6 +38,7 @@
private static final Predicate<String> IDENTIFIER_PREDICATE = Pattern.compile("\\d{4}/\\d{3,5}").asPredicate();
private static final String CITATION_URL_PREFIX = "https://eprint.iacr.org/eprint-bin/cite.pl?entry=";
private static final String DESCRIPTION_URL_PREFIX = "https://eprint.iacr.org/";
private static final Charset WEBSITE_CHARSET = Charset.forName("iso-8859-1");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very very minor, you can directly use the predefined enum Constant:
https://docs.oracle.com/javase/8/docs/api/java/nio/charset/StandardCharsets.html#ISO_8859_1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out - was looking for something like that, but apparently didn't look long enough...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other cases where Java provides default enum variables, they all start with StandardXXX, for example for file opening there exists: StandardOpenOption, not really obvious if you search for it ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jep, I expected them in something like Charsets or directly as constants in the Charset class...

private static final Log LOGGER = LogFactory.getLog(IacrEprintFetcher.class);
private static final Pattern DATE_FROM_WEBSITE_PATTERN = Pattern.compile("[a-z ]+(\\d{1,2} [A-Za-z][a-z]{2} \\d{4})");
private static final DateFormat DATE_FORMAT_WEBSITE = new SimpleDateFormat("dd MMM yyyy");
private static final DateFormat DATE_FORMAT_BIBTEX = new SimpleDateFormat("yyyy-MM-dd");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleDateFormat is outdated, it has been replaced by several other constructs in java8:
http://www.baeldung.com/java-8-date-time-intro
Or see example 18 here for an idea: http://javarevisited.blogspot.de/2015/03/20-examples-of-date-and-time-api-from-Java8.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know that - I'm working on changing it.

String bibtexCitationHtml = getHtml(CITATION_URL_PREFIX + validIdentifier);
String actualEntry = getValueBetween("<PRE>", "</PRE>", bibtexCitationHtml);

Optional<BibEntry> entry;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but you probably need to initialize it with Optional.empty() or you could get still an NPE if no entry is found

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the javadoc on BibtexParser.singleFromString is correct, it should always return an entry or an Optional.empty().
But I might as well initialize it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also just return the entry directly in the try construct. This is in my opinion the most readable solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

abram2017 = new BibEntry();
abram2017.setType(BiblatexEntryTypes.MISC);
abram2017.setField("bibtexkey", "cryptoeprint:2017:1118");
abram2017.setField(FieldName.ABSTRACT, "The decentralized cryptocurrency Bitcoin has experienced great success but also encountered many challenges. One of the challenges has been the long confirmation time. Another challenge is the lack of incentives at certain steps of the protocol, raising concerns for transaction withholding, selfish mining, etc. To address these challenges, we propose Solida, a decentralized blockchain protocol based on reconfigurable Byzantine consensus augmented by proof-of-work. Solida improves on Bitcoin in confirmation time, and provides safety and liveness assuming the adversary control less than (roughly) one-third of the total mining power.\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no abstracts in tests, as they are usually subject to copyright of the publisher

Copy link
Contributor Author

@derTimme derTimme Dec 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into the license details - not having the abstract there is a bit of a problem as fetching the abstract is part of the functionality; therefore it should be tested.
On first glance, the whole article is published under CC-BY or CC-BY-NC; therefore this shouldn't be a problem as the authors are mentioned right next to the abstract text.
But if you still don't want the abstracts, I can modify the tests accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a published article that can be received via IACR? In this case you hold the copyright and there is no problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet ;) But then I'll just remove the abstracts for now and maybe just check that they are not empty.

String bibtexCitationHtml = getHtml(CITATION_URL_PREFIX + validIdentifier);
String actualEntry = getValueBetween("<PRE>", "</PRE>", bibtexCitationHtml);

Optional<BibEntry> entry;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also just return the entry directly in the try construct. This is in my opinion the most readable solution.

@FetcherTest
public class IacrEprintFetcherTest {

private static IacrEprintFetcher fetcher;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please not static and use BeforeEach instead BeforeAll (better initialize the fetcher and entries fresh for each test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's my first time using JUnit 5 and I just looked at some examples online which used BeforeAll...
Should have been suspicious when adding the statics...

abram2017 = new BibEntry();
abram2017.setType(BiblatexEntryTypes.MISC);
abram2017.setField("bibtexkey", "cryptoeprint:2017:1118");
abram2017.setField(FieldName.ABSTRACT, "The decentralized cryptocurrency Bitcoin has experienced great success but also encountered many challenges. One of the challenges has been the long confirmation time. Another challenge is the lack of incentives at certain steps of the protocol, raising concerns for transaction withholding, selfish mining, etc. To address these challenges, we propose Solida, a decentralized blockchain protocol based on reconfigurable Byzantine consensus augmented by proof-of-work. Solida improves on Bitcoin in confirmation time, and provides safety and liveness assuming the adversary control less than (roughly) one-third of the total mining power.\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a published article that can be received via IACR? In this case you hold the copyright and there is no problem.

@derTimme derTimme changed the title Fetcher for IACR eprints [WIP] Fetcher for IACR eprints Dec 2, 2017
@derTimme
Copy link
Contributor Author

derTimme commented Dec 2, 2017

I just ran some "manual" tests for very old eprints and discovered that they indeed changed something in 2000 - up to 2000, the eprints don't have versions and the date format is a bit different.
I made this "WIP" again and am working on also supporting these old entries.

@tobiasdiez tobiasdiez removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 2, 2017
The entries before year 2000 use a slightly different format which e.g. doesn't
include a version, also the date format is different.

With this commit, we also throw an error if the user tries to fetch an entry
for a withdrawn paper. This is meant as a warning to the user, she might still
add the entry manually to her database.
This will be especially useful once a "search by title" or something similar
gets implemented.
@derTimme
Copy link
Contributor Author

derTimme commented Dec 7, 2017

I implemented support for pre-2000 entries; also added some tests for these as the pre-2000 entries don't seem to really have a "strong" standard format.
Luckily there aren't too much of these (about 65), so the tests just cover all of them.

I will now do some more "manual" testing, but I think everything will be finished in the next few hours.

Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks really good. I have just one remark concerning the translated strings and another suggestion:
There exists a convenient interface https://github.com/JabRef/jabref/blob/master/src/main/java/org/jabref/logic/importer/IdBasedParserFetcher.java for fetcher that follow the usual scheme: determine url, fetch response, parse. Since your fetcher follows this strategy it might be better to use IdBasedParserFetcher as a base. There are quite a few implementations already which may serve as a guide.

@@ -1819,6 +1819,11 @@ Copy_BibTeX_key_and_title=
File_rename_failed_for_%0_entries.=
Merged_BibTeX_source_code=
Invalid_DOI\:_'%0'.=Ugyldig_DOI\:_'%0'.
Invalid_IACR_identifier\:_'%0'.=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to use a more generic version in these strings. As of now, they are not reusable in other fetchers or situations. E.g. just use invalid identifier or replace ICAR by a parameter slot.

Also: Remove some of them completly, replacing them with slightly
different existing ones.
@derTimme
Copy link
Contributor Author

derTimme commented Dec 7, 2017

I took a look at the IdBasedParserFetcher and if you insist, I could use it - but this fetcher works slightly different in that it requests two different URLs and builds the resulting entry with information from both those URLs.

@derTimme derTimme changed the title [WIP] Fetcher for IACR eprints Fetcher for IACR eprints Dec 7, 2017
@tobiasdiez tobiasdiez added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 7, 2017
@tobiasdiez
Copy link
Member

ok, then leave it like that. It was just a suggestion that came to my mind while browsing the code.

codacy complained about reassigning a method parameter
and about the visibility of a test method.
Copy link
Member

@lenhard lenhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks really good and I have nothing to criticize. It is ready to merge from my point of view.

However, I tried to test the feature in a running JabRef and somehow the IACR fetcher is not showing up in the drop-down menu in the web search side bar. It should be there, right? Am I missing something?

@derTimme
Copy link
Contributor Author

@lenhard I see your point - I haven't thought about that as I usually only use the "New entry" -> "ID based" way to fetch entries.
After looking into it, I don't think it belongs in the web search side bar as it only searches based on the ID of a paper; the DOI fetcher isn't in the side bar either. This makes sense from my point of view, as the web search side bar doesn't give any clues that it expects an ID. But if you think it makes sense to add it there, I can do that.

@lenhard
Copy link
Member

lenhard commented Dec 11, 2017

@derTimme Thanks for the explanation! What you write makes sense. There's no need to add the fetcher to the web search side bar as well.

Unfortunately, there are some merge conflicts in the language files now. Could you please resolve those and then this is ready to go into master.

@derTimme
Copy link
Contributor Author

I'm really confused right now... I tried to merge in master and apparently it worked - in my local repo, the localization files look fine and I can do

git checkout master
git merge iacr_eprint_fetcher

without any conflicts...

@Siedlerchr
Copy link
Member

@derTimme You have to first configure the JabRef/jabref repo as remote repository. Your master branch is still on the version when it was first forked. It does not get synced automatically.

Gitbub Help on adding a cloned repo with a remote one

Add JabRef as remote repo:

git remote add upstream https://github.com/JabRef/jabref.git
git checkout <your branch>  
git merge upstream/master 

The last one merges the changes from the upstream repo (in this case the JabRef/jabref main repo)

@Siedlerchr
Copy link
Member

Okay, as all conflicts are resolved and the reviews are okay, I merge it now into master!
Thank you very much for your contribution! We hope you continue your contribution ;)

@Siedlerchr Siedlerchr merged commit c2d0070 into JabRef:master Dec 13, 2017
@tobiasdiez
Copy link
Member

tobiasdiez commented Dec 13, 2017

I'll merge this now before you have to cope again with changes on the master branch. I'm sorry for the inconvenience caused by our recent change to the language files. Thanks for your contribution!

Edit: ok @Siedlerchr was quicker :-)

@derTimme
Copy link
Contributor Author

Thank you :)

Siedlerchr added a commit that referenced this pull request Dec 13, 2017
* upstream/master: (108 commits)
  Fetcher for IACR eprints (#3473)
  Update internal state of DatabaseChangeMonitor when external changes … (#3503)
  Fixes #3505: Another try to fix the NPE in the search bar (#3512)
  Replace &apos; with &#39; so that our HTML preview can handle it correctly
  Added a "Clear text" button in right click menu within the text boxes. (#3475)
  Add reset to English language after a test
  New translations JabRef_en.properties (German)
  Remove ampersand in non-menu localizations
  New translations JabRef_en.properties (German)
  New translations Menu_en.properties (German)
  New translations Menu_en.properties (German)
  New translations JabRef_en.properties (Vietnamese)
  New translations JabRef_en.properties (Italian)
  New translations Menu_en.properties (Italian)
  New translations JabRef_en.properties (Indonesian)
  New translations Menu_en.properties (Indonesian)
  New translations JabRef_en.properties (Greek)
  New translations Menu_en.properties (Greek)
  New translations Menu_en.properties (Japanese)
  New translations JabRef_en.properties (German)
  ...

# Conflicts:
#	build.gradle
@koppor koppor mentioned this pull request May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants