PDF Downloader downloads HTML source code instead if no access #7452

Krzmbrzl · 2021-02-17T11:42:32Z

JabRef 5.2--2020-12-24--6a2a512
Linux 5.4.0-59-generic amd64
Java 15.0.1

Mandatory: I have tested the latest development version from http://builds.jabref.org/master/ and the problem persists

Steps to reproduce the behavior:

Import an entry via Browser extension for which one does not have acces rights for the PDF (e.g. https://onlinelibrary.wiley.com/doi/abs/10.1002/0470862106.ia615)
JabRef automatically downloads the PDF (or rather it tries to do so)
Instead of the PDF it downloads the HTML webpage that would pop up if I was to try to download the paper manually. Thus my "PDF" now only contains HTML source code

I have encountered this a few times by now and in all cases the downloaded "PDF" was a plain text file containg the HTML source code.
My suggestion for a mitigation would be to check the downloaded file and if it starts with plain text <!DOCTYPE html>, then assume the download has failed and remove the "PDF" again (restoring the file link with the download URL in JabRef).

The text was updated successfully, but these errors were encountered:

grundb · 2021-02-26T17:41:03Z

Me and a few friends, @kittytinythai, @keivanm, @binsu-kth and @Kayi1500, would like to resolve this issue for a course project in DD2480 Software Engineering Fundamentals at KTH Royal Institute of Technology. Do you have any suggestions or things we need to know before we get started?

Siedlerchr · 2021-02-26T17:53:58Z

Welcome to JabRef! As a starting point make sure to follow our conribution

Codewise the functionality of file downloading is handled here

jabref/src/main/java/org/jabref/gui/fieldeditors/LinkedFileViewModel.java

Line 416 in 60cc355

public void download() {

BJaroszkowski · 2021-02-26T20:39:55Z

I have looked into this and I can confirm that there is an issue with how downloading files via URL is handled. Basically, FileDownloadTask does not throw an exception if the destination extension does not match the type of the file to be downloaded and not even when the URL is valid but does not exist. In the latter case URLDownload returns an empty input stream which is then copied to a destination file leaving us with an empty file in the file system.

The question is how to deal with that. The proposed solution would lead to the cases when after mistyping the URL the download process would fail silently without informing the user about what happened. A quick and dirty fix that I have implemented is to do a check within the FileDownloadTask that looks something like:

        URLDownload download = new URLDownload(source);
        if (!download.canBeReached() || (destination.toString().endsWith(".pdf") && !download.isPdf()) ) {
            throw new IOException("The provided URL is inaccessible or does not exist");
        }

This of course would only work for downloading files with .pdf extension. I think a better idea would be to do a similar check within prepareDownloadTask method of LinkedFileViewModel where we actually detect file extension. I can work on that but I would rather have one of the devs weigh in first.

Siedlerchr · 2021-02-27T17:46:26Z

I think a better idea would be to do a similar check within prepareDownloadTask method of LinkedFileViewModel where
we actually detect file extension. I can work on that but I would rather have one of the devs weigh in first.

Thanks for looking into it, UrlDownload has a method for detecting the mime type of the file. Maybe you can use that in addition for checking with the empty File stream. Having empty files is not useful. It could be checked if a) is reachable and maybe give a warning if the mime type is HTML
But keep in mind that some users might want to save a snapshot of a website (e.g. for online sources from websites).

… if the file downloaded is HTML

grundb · 2021-02-28T15:57:49Z

We have now produced a draft PR for this (#7474), and we would appreciate any feedback. It seems much of the logic for saving the file as an html file was already implemented, but the problem was that the getExternalFileTypeByMimeType method ignored the optional parameter part of the mime type string:

jabref/src/main/java/org/jabref/gui/externalfiletype/ExternalFileTypes.java

Lines 51 to 64 in 1f775d7

    
               /** 
        
                * Look up the external file type registered with this name, if any. 
        
                * 
        
                * @param name The file type name. 
        
                * @return The ExternalFileType registered, or null if none. 
        
                */ 
        
               public Optional<ExternalFileType> getExternalFileTypeByName(String name) { 
        
                   Optional<ExternalFileType> externalFileType = externalFileTypes.stream().filter(type -> type.getName().equals(name)).findFirst(); 
        
                   if (externalFileType.isPresent()) { 
        
                       return externalFileType; 
        
                   } 
        
                   // Return an instance that signifies an unknown file type: 
        
                   return Optional.of(new UnknownExternalFileType(name)); 
        
               }

This solution does not deal with the case of empty files as discussed by @BJaroszkowski, but this could of course be included (or addressed in a separate issue). We plan to add appropriate unit tests before a merge.

Adds ignore parameter to mime type and notify the user if the file downloaded is HTML.

Use Localization when writing messages in the status bar.

replace StringUtils::substringBefore with String::substring

Add test ensuring the UI warns the user if they download a linked HTML file (i.e. a web page).

The test checks the resulting file type when downloading a HTML file.

Add check to only process the mimeType if an ';' exists inside the string

Tests that mime type with parameter value is parsed correctly to exclude the parameter.

Sets a new system-wide cookie manager if there is none, and sets the cookie policy to ACCEPT_NONE after each test.

Clean up code styling according to JabRef style guidelines.

…ed in testing (JabRef#7452)

…roup-22/jabref into DD2480-2021-group-22-fix-for-issue-7452 * 'fix-for-issue-7452' of https://github.com/DD2480-2021-group-22/jabref: Refactor LinkedFileViewModelTest removing redundant code (#7452) Refactor LinkedFileViewModelTest removing redundant code (#7452) Refactor LinkedFileViewModelTest adding mock for JabRefPreferences used in testing (#7452) Remove duplicate changelog entry (#7452) Clean up code style (#7452) Clean up code styling according to JabRef style guidelines. Add test for when a linked file points to a PDF url (#7452) Reset cookie policy in test (#7452) Clarify changes (#7452) Add changes to changelog (#7452) Add unit test for mime type parsing (#7452) Fix mime type parsing bug (#7452) Add check to only process the mimeType if an ';' exists inside the string Add unit test for HTML file (#7452) Add UI test (#7452) Add test ensuring the UI warns the user if they download a linked HTML file (i.e. a web page). Replace apache StringUtils (#7452) replace StringUtils::substringBefore with String::substring Add debug message (#7452) Update status bar message (#7452) Ignore mime type params (#7452)

* Ignore mime type params (#7452) Adds ignore parameter to mime type and notify the user if the file downloaded is HTML. * Update status bar message (#7452) Use Localization when writing messages in the status bar. * Add debug message (#7452) * Replace apache StringUtils (#7452) replace StringUtils::substringBefore with String::substring * Add UI test (#7452) Add test ensuring the UI warns the user if they download a linked HTML file (i.e. a web page). * Add unit test for HTML file (#7452) The test checks the resulting file type when downloading a HTML file. * Fix mime type parsing bug (#7452) Add check to only process the mimeType if an ';' exists inside the string * Add unit test for mime type parsing (#7452) Tests that mime type with parameter value is parsed correctly to exclude the parameter. * Add changes to changelog (#7452) * Clarify changes (#7452) * Reset cookie policy in test (#7452) Sets a new system-wide cookie manager if there is none, and sets the cookie policy to ACCEPT_NONE after each test. * Add test for when a linked file points to a PDF url (#7452) * Clean up code style (#7452) Clean up code styling according to JabRef style guidelines. * Remove duplicate changelog entry (#7452) * Refactor LinkedFileViewModelTest adding mock for JabRefPreferences used in testing (#7452) * Refactor LinkedFileViewModelTest removing redundant code (#7452) * Refactor LinkedFileViewModelTest removing redundant code (#7452) * Fix tests * fix checkstyle Co-authored-by: Binxin <binxin@kth.se> Co-authored-by: kittyt <kittyt@kth.se> Co-authored-by: Keivan Matinzadeh <matinzadeh.keivan@gmail.com> Co-authored-by: Johan Grundberg <johan.grundberg98@gmail.com> Co-authored-by: Johan Grundberg <grundb@kth.se> Co-authored-by: kaniyi <kaniyi@kth.se>

tobiasdiez added import fetcher labels Feb 17, 2021

tobiasdiez added type: enhancement good first issue An issue intended for project-newcomers. Varies in difficulty. labels Feb 17, 2021

binsu-kth added a commit to DD2480-2021-group-22/jabref that referenced this issue Feb 28, 2021

fix JabRef#7452: Adds ignore parameter to minetyp and notify the user…

622abb9

… if the file downloaded is HTML

binsu-kth mentioned this issue Feb 28, 2021

PDF Downloader downloads HTML source code instead if no access #7474

Merged

5 tasks

grundb mentioned this issue Mar 1, 2021

Provide file download link when download fails DD2480-2021-group-22/jabref#6

Closed

kittytinythai mentioned this issue Mar 2, 2021

Test: Download HTML when linked file points to HTML DD2480-2021-group-22/jabref#17

Closed

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Ignore mime type params (JabRef#7452)

c6e95c7

Adds ignore parameter to mime type and notify the user if the file downloaded is HTML.

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Update status bar message (JabRef#7452)

a99708f

Use Localization when writing messages in the status bar.

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Add debug message (JabRef#7452)

6f4ca6b

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Replace apache StringUtils (JabRef#7452)

a4c3008

replace StringUtils::substringBefore with String::substring

grundb added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Add UI test (JabRef#7452)

24669c8

Add test ensuring the UI warns the user if they download a linked HTML file (i.e. a web page).

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Add unit test for HTML file (JabRef#7452)

8867b5b

The test checks the resulting file type when downloading a HTML file.

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Fix mime type parsing bug (JabRef#7452)

647c1b1

Add check to only process the mimeType if an ';' exists inside the string

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Add unit test for mime type parsing (JabRef#7452)

a55b2bc

Tests that mime type with parameter value is parsed correctly to exclude the parameter.

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Add changes to changelog (JabRef#7452)

2f536a3

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Clarify changes (JabRef#7452)

85e8f8d

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Reset cookie policy in test (JabRef#7452)

0f5a819

Sets a new system-wide cookie manager if there is none, and sets the cookie policy to ACCEPT_NONE after each test.

grundb pushed a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Add test for when a linked file points to a PDF url (JabRef#7452)

e2f4c6f

grundb added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Clean up code style (JabRef#7452)

5ec83b6

Clean up code styling according to JabRef style guidelines.

grundb added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 4, 2021

Remove duplicate changelog entry (JabRef#7452)

40fa1dd

binsu-kth added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 6, 2021

Refactor LinkedFileViewModelTest adding mock for JabRefPreferences us…

32b9433

…ed in testing (JabRef#7452)

binsu-kth added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 6, 2021

Refactor LinkedFileViewModelTest removing redundant code (JabRef#7452)

263178a

binsu-kth added a commit to DD2480-2021-group-22/jabref that referenced this issue Mar 6, 2021

Refactor LinkedFileViewModelTest removing redundant code (JabRef#7452)

b873b33

Siedlerchr closed this as completed in #7474 Mar 14, 2021

koppor moved this to Done in Features & Enhancements Nov 7, 2022

koppor added this to Features & Enhancements Nov 7, 2022

Siedlerchr mentioned this issue Aug 28, 2023

"Download linked file" option creates an html file instead of downloading the pdf on Windows 10. #10149

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Downloader downloads HTML source code instead if no access #7452

PDF Downloader downloads HTML source code instead if no access #7452

Krzmbrzl commented Feb 17, 2021

grundb commented Feb 26, 2021

Siedlerchr commented Feb 26, 2021

BJaroszkowski commented Feb 26, 2021

Siedlerchr commented Feb 27, 2021

grundb commented Feb 28, 2021

PDF Downloader downloads HTML source code instead if no access #7452

PDF Downloader downloads HTML source code instead if no access #7452

Comments

Krzmbrzl commented Feb 17, 2021

grundb commented Feb 26, 2021

Siedlerchr commented Feb 26, 2021

BJaroszkowski commented Feb 26, 2021

Siedlerchr commented Feb 27, 2021

grundb commented Feb 28, 2021