Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Downloader downloads HTML source code instead if no access #7452

Closed
1 task
Krzmbrzl opened this issue Feb 17, 2021 · 5 comments · Fixed by #7474
Closed
1 task

PDF Downloader downloads HTML source code instead if no access #7452

Krzmbrzl opened this issue Feb 17, 2021 · 5 comments · Fixed by #7474
Labels
fetcher good first issue An issue intended for project-newcomers. Varies in difficulty. import type: enhancement

Comments

@Krzmbrzl
Copy link

JabRef 5.2--2020-12-24--6a2a512
Linux 5.4.0-59-generic amd64
Java 15.0.1

Steps to reproduce the behavior:

  1. Import an entry via Browser extension for which one does not have acces rights for the PDF (e.g. https://onlinelibrary.wiley.com/doi/abs/10.1002/0470862106.ia615)
  2. JabRef automatically downloads the PDF (or rather it tries to do so)
  3. Instead of the PDF it downloads the HTML webpage that would pop up if I was to try to download the paper manually. Thus my "PDF" now only contains HTML source code

I have encountered this a few times by now and in all cases the downloaded "PDF" was a plain text file containg the HTML source code.
My suggestion for a mitigation would be to check the downloaded file and if it starts with plain text <!DOCTYPE html>, then assume the download has failed and remove the "PDF" again (restoring the file link with the download URL in JabRef).

@tobiasdiez tobiasdiez added type: enhancement good first issue An issue intended for project-newcomers. Varies in difficulty. labels Feb 17, 2021
@grundb
Copy link
Contributor

grundb commented Feb 26, 2021

Me and a few friends, @kittytinythai, @keivanm, @binsu-kth and @Kayi1500, would like to resolve this issue for a course project in DD2480 Software Engineering Fundamentals at KTH Royal Institute of Technology. Do you have any suggestions or things we need to know before we get started?

@Siedlerchr
Copy link
Member

Welcome to JabRef! As a starting point make sure to follow our conribution

Codewise the functionality of file downloading is handled here

@BJaroszkowski
Copy link
Contributor

I have looked into this and I can confirm that there is an issue with how downloading files via URL is handled. Basically, FileDownloadTask does not throw an exception if the destination extension does not match the type of the file to be downloaded and not even when the URL is valid but does not exist. In the latter case URLDownload returns an empty input stream which is then copied to a destination file leaving us with an empty file in the file system.

The question is how to deal with that. The proposed solution would lead to the cases when after mistyping the URL the download process would fail silently without informing the user about what happened. A quick and dirty fix that I have implemented is to do a check within the FileDownloadTask that looks something like:

        URLDownload download = new URLDownload(source);
        if (!download.canBeReached() || (destination.toString().endsWith(".pdf") && !download.isPdf()) ) {
            throw new IOException("The provided URL is inaccessible or does not exist");
        }

This of course would only work for downloading files with .pdf extension. I think a better idea would be to do a similar check within prepareDownloadTask method of LinkedFileViewModel where we actually detect file extension. I can work on that but I would rather have one of the devs weigh in first.

@Siedlerchr
Copy link
Member

I think a better idea would be to do a similar check within prepareDownloadTask method of LinkedFileViewModel where
we actually detect file extension. I can work on that but I would rather have one of the devs weigh in first.

Thanks for looking into it, UrlDownload has a method for detecting the mime type of the file. Maybe you can use that in addition for checking with the empty File stream. Having empty files is not useful. It could be checked if a) is reachable and maybe give a warning if the mime type is HTML
But keep in mind that some users might want to save a snapshot of a website (e.g. for online sources from websites).

binsu-kth added a commit to DD2480-2021-group-22/jabref that referenced this issue Feb 28, 2021
@grundb
Copy link
Contributor

grundb commented Feb 28, 2021

We have now produced a draft PR for this (#7474), and we would appreciate any feedback. It seems much of the logic for saving the file as an html file was already implemented, but the problem was that the getExternalFileTypeByMimeType method ignored the optional parameter part of the mime type string:

/**
* Look up the external file type registered with this name, if any.
*
* @param name The file type name.
* @return The ExternalFileType registered, or null if none.
*/
public Optional<ExternalFileType> getExternalFileTypeByName(String name) {
Optional<ExternalFileType> externalFileType = externalFileTypes.stream().filter(type -> type.getName().equals(name)).findFirst();
if (externalFileType.isPresent()) {
return externalFileType;
}
// Return an instance that signifies an unknown file type:
return Optional.of(new UnknownExternalFileType(name));
}

This solution does not deal with the case of empty files as discussed by @BJaroszkowski, but this could of course be included (or addressed in a separate issue). We plan to add appropriate unit tests before a merge.

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
Adds ignore parameter to mime type and notify the user if the file downloaded is HTML.
grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
Use Localization when writing messages in the status bar.
grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
replace StringUtils::substringBefore with String::substring
grundb added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
Add test ensuring the UI warns the user if they download a linked HTML file (i.e. a web page).
grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
The test checks the resulting file type when downloading a HTML file.
grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
Add check to only process the mimeType if an ';' exists inside the string
grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
Tests that mime type with parameter value is parsed correctly to exclude the parameter.
grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
Sets a new system-wide cookie manager if there is none, and sets the cookie policy
to ACCEPT_NONE after each test.
grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
grundb added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
Clean up code styling according to JabRef style guidelines.
grundb added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021
binsu-kth added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 6, 2021
binsu-kth added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 6, 2021
binsu-kth added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 6, 2021
Siedlerchr added a commit that referenced this issue Mar 14, 2021
…roup-22/jabref into DD2480-2021-group-22-fix-for-issue-7452

* 'fix-for-issue-7452' of https://github.com/DD2480-2021-group-22/jabref:
  Refactor LinkedFileViewModelTest removing redundant code (#7452)
  Refactor LinkedFileViewModelTest removing redundant code (#7452)
  Refactor LinkedFileViewModelTest adding mock for JabRefPreferences used in testing (#7452)
  Remove duplicate changelog entry (#7452)
  Clean up code style (#7452) Clean up code styling according to JabRef style guidelines.
  Add test for when a linked file points to a PDF url (#7452)
  Reset cookie policy in test (#7452)
  Clarify changes (#7452)
  Add changes to changelog (#7452)
  Add unit test for mime type parsing (#7452)
  Fix mime type parsing bug (#7452) Add check to only process the mimeType if an ';' exists inside the string
  Add unit test for HTML file (#7452)
  Add UI test (#7452) Add test ensuring the UI warns the user if they download a linked HTML file (i.e. a web page).
  Replace apache StringUtils (#7452) replace StringUtils::substringBefore with String::substring
  Add debug message (#7452)
  Update status bar message (#7452)
  Ignore mime type params  (#7452)
Siedlerchr added a commit that referenced this issue Mar 14, 2021
* Ignore mime type params  (#7452)

Adds ignore parameter to mime type and notify the user if the file downloaded is HTML.

* Update status bar message (#7452)

Use Localization when writing messages in the status bar.

* Add debug message (#7452)

* Replace apache StringUtils (#7452)
replace StringUtils::substringBefore with String::substring

* Add UI test (#7452)
Add test ensuring the UI warns the user if they download a linked HTML file (i.e. a web page).

* Add unit test for HTML file (#7452)

The test checks the resulting file type when downloading a HTML file.

* Fix mime type parsing bug (#7452)
Add check to only process the mimeType if an ';' exists inside the string

* Add unit test for mime type parsing (#7452)

Tests that mime type with parameter value is parsed correctly to exclude the parameter.

* Add changes to changelog (#7452)

* Clarify changes (#7452)

* Reset cookie policy in test (#7452)

Sets a new system-wide cookie manager if there is none, and sets the cookie policy
to ACCEPT_NONE after each test.

* Add test for when a linked file points to a PDF url (#7452)

* Clean up code style (#7452)
Clean up code styling according to JabRef style guidelines.

* Remove duplicate changelog entry (#7452)

* Refactor LinkedFileViewModelTest adding mock for JabRefPreferences used in testing (#7452)

* Refactor LinkedFileViewModelTest removing redundant code (#7452)

* Refactor LinkedFileViewModelTest removing redundant code (#7452)

* Fix tests

* fix checkstyle

Co-authored-by: Binxin <binxin@kth.se>
Co-authored-by: kittyt <kittyt@kth.se>
Co-authored-by: Keivan Matinzadeh <matinzadeh.keivan@gmail.com>
Co-authored-by: Johan Grundberg <johan.grundberg98@gmail.com>
Co-authored-by: Johan Grundberg <grundb@kth.se>
Co-authored-by: kaniyi <kaniyi@kth.se>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fetcher good first issue An issue intended for project-newcomers. Varies in difficulty. import type: enhancement
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants