Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem verifying the hash of a downloaded zip file from ICGEM #185

Closed
MarkWieczorek opened this issue Jun 17, 2020 · 6 comments
Closed

Comments

@MarkWieczorek
Copy link

MarkWieczorek commented Jun 17, 2020

I am having a problem verifying the hash of zip file downloaded from the ICGEM website. I have downloaded zip files with pooch from other repositories, so in principle, I should be doing everything ok.

First, on the ICGEM website, you can download a file in gfc format (which works fine for me with pooch) or a zipped version. If I copy the link from the website for the zipped version of EGM2008, I get

http://icgem.gfz-potsdam.de/getmodel/zip/c50128797a9cb62e936337c890e4425f03f0461d7329b09a8cc8561504465340

Using this link in a browser downloads and saves the file: EGM2008.zip (from which I computed the sha256 hash).

Using pooch the file is download to the filename

d99404d2e294332575026111bd03dbf3-c50128797a9cb62e936337c890e4425f03f0461d7329b09a8cc8561504465340

Pooch however complains that the hash of the file doesn't match

ValueError: SHA256 hash of downloaded file (d99404d2e294332575026111bd03dbf3-c50128797a9cb62e936337c890e4425f03f0461d7329b09a8cc8561504465340) does not match the known hash: expected sha256:9393a9100a61bab4353d8f8d429cbc3b344153690adfbf5ac678eec92ab9fdef but got 92d03699ad51510b4faf815a9c3c59db8211c9a8d18c576717a90a4ece493153. Deleted download for safety. The downloaded file may have been corrupted or the known hash may be outdated.

I do not want to unzip the file (it will be unzipped on the fly when needed).

Any ideas?

Here is the code

fname = retrieve(url="http://icgem.gfz-potsdam.de/getmodel/zip/c50128797a9cb62e936337c890e4425f03f0461d7329b09a8cc8561504465340", known_hash="sha256:9393a9100a61bab4353d8f8d429cbc3b344153690adfbf5ac678eec92ab9fdef", downloader=HTTPDownloader(progressbar=True))
@leouieda
Copy link
Member

leouieda commented Jun 18, 2020

Hi @MarkWieczorek I suspect that the URL you're given actually redirects to the file URL. Pooch has no way of guessing that it does that so it is downloading the redirect page instead of the file. In retrieve you can pass None as the hash to skip the check. If you inspect the file, it will probably be and HTML page.

To get around that, you need to first make a request to get the actual file URL and then pass that to the downloader. There is an example here https://www.fatiando.org/pooch/latest/usage.html#custom-downloaders (you can skip the authentication part)

@MarkWieczorek
Copy link
Author

MarkWieczorek commented Jun 18, 2020

If I unzip the file

unzip d99404d2e294332575026111bd03dbf3-c50128797a9cb62e936337c890e4425f03f0461d7329b09a8cc8561504465340

I get the correct file EGM2008.gfc. So the file is being downloaded correctly.

What is odd, though, is that the hash of the zipped file pooch downloads is

SHA256(d99404d2e294332575026111bd03dbf3-c50128797a9cb62e936337c890e4425f03f0461d7329b09a8cc8561504465340)= 6fa8ce2b598b7406955a1c10c87c7ffeca8c1e77c07590c1987f3418a9669be9

But if I download the zip file from a browser, the hash is different:

SHA256(EGM2008.zip)= 9393a9100a61bab4353d8f8d429cbc3b344153690adfbf5ac678eec92ab9fdef

I don't know how the hash could be different. I don't think that the filename affects the hash.

Lastly: Do you know how my browser knows how to save the file with the name EGM2008.gfc, but pooch doesn't? The second issue for me is that the file isn't saved with the .zip extension, and this breaks things downstream for me....

@MarkWieczorek
Copy link
Author

Ok. I just made some progress:

After downloading the "same" file from the ICGEM website, I realized that the hash of this file was different each time. I suspect that what is happening is that there is some kind of time stamp in the file. Perhaps they are zipping these files whenever they are requested. I'm going to contact them about this, as the hash problem of the zip archive is not something that we can solve.

Nevertheless, I have two suggestions:

  1. If pooch could at least decode the correct filename (as mentioned above), as opposed to returning the file d99404d2e294332575026111bd03dbf3-c50128797a9cb62e936337c890e4425f03f0461d7329b09a8cc8561504465340, that would help.
  2. Another potential solution/feature would be to verify the hash of the unzipped file, instead of the zipped archive.

@leouieda
Copy link
Member

leouieda commented Jun 18, 2020

Perhaps they are zipping these files whenever they are requested

Ah that would definitely break things! For now, one way around that would be to make a custom downloader that unzips the file the in memory before saving (which might not be ideal). Or saves it to a temp file and then moves the unzipped version. Then you can store the hash of the unzipped file.

If pooch could at least decode the correct filename (as mentioned above),

My guess is that the file name is located somewhere in the HTTP GET request. Pooch takes the file name from the URL because I'm ignorant and didn't consider this use case 🙂 If using pooch.Pooch, I think it should work if you set the correct file name in the registry:

EGM2008.zip http://icgem.gfz-potsdam.de/getmodel/zip/c50128797a9cb62e936337c890e4425f03f0461d7329b09a8cc8561504465340 HASH

In retrieve you can set the fname argument.

Getting the file name from the request would be difficult because right now Pooch functions are transparent to the download method. Doing this would require knowing that this is HTTP. I'll keep this in mind but it might require a lot of refactoring of the code. I'm open to any suggestions, though.

Another potential solution/feature would be to verify the hash of the unzipped file, instead of the zipped archive.

This is also tricky because we don't want to touch the file if the hash doesn't match (meaning that it could be corrupted). But it can be done as stated above. I would be hesitant to make this easy, though.

@MarkWieczorek
Copy link
Author

MarkWieczorek commented Jun 19, 2020

Here is the response I got from ICGEM.

  • The zip files are now created on the fly, but in the future they may cache these.
  • In the future they may use gz compression, where the timestamp wouldn't change.
  • They fixed the URLs so that it includes the model name with the .zip extension.
  • They mentioned that the first part of the filename is a hash of the unzipped file. (I don't think that this is of use for us though).

So, the only thing that pooch could do at this point is

  • Unzip the file, and then compare the hash of the unzipped file with a known hash.
  • Add a warning in the documentation that zip files have timestamps, that might hinder comparing caches.

I understand why this is non ideal. Given that I am now just going to download the unzipped files, and wait until they implement gz, if you want to close this issue, I would be ok with that.

@leouieda
Copy link
Member

@MarkWieczorek yep, these little tricks should be documented somewhere. We don't have a good place for them on the docs at this point. See #188

SHTools + ICGEM is going to be awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants