Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DOIs with slashes in the filename #337

Closed
wants to merge 6 commits into from
Closed

Conversation

santisoler
Copy link
Member

@santisoler santisoler commented Jan 30, 2023

Allow DOI downloaders to work with files uploaded to Zenodo that have a forward slash "/" in their name. These special cases are generated by the Zenodo-GitHub integration service, where the first portion of the filename matches the GitHub handle of the repository owned and the last portion matches the repository name. When creating the url to these repositories, the forward slashes that are part of the filename should be led by a backslash. For example: url = "doi:10.123/zenodo.123123/my-file\/has-a-slash.zip".

Relevant issues/PRs:

Fixes #336

Todo

In case this patch works, we need to:

  • Try to add a file with a slash to our test repos in:
    • Zenodo: 10.5281/zenodo.7632643
      • It was created through the Zenodo-GitHub integration service. By default the zip file includes the user handle, a forward slash and the repo name.
  • Update tests to see if we can download them
  • Improve docs to instruct on how to use this feature
    • Extend the docstring of DOIDownloader
    • Extend the docstring of retrieve
    • Add an example in the user guide

Allow DOI downloaders to work with files uploaded to services like
Zenodo, Figshare or Dataverse that have a forward slash in their name.
@santisoler
Copy link
Member Author

santisoler commented Jan 30, 2023

@djmannion

With the patch in this PR you should be able to download your zip file that have a slash in its name through:

import pooch

raw_data_path = pooch.retrieve(
    url=r"doi:10.5281/zenodo.7347607/Wild-Minds\/GreatApeDictionary-v1.1.zip",
    known_hash="md5:8c40d1bce25619548f4daa16d63f36a3",
)

I decided to use a leading backslash before the offending slash in the filename to separate it from the other slashes that are part of the url. In order to use a backslash in a Python string without meaning to use it as a escape character we need to use r-strings with r"".

It works locally for me, does it for you?

@djmannion
Copy link

Yes, that works for me now - thanks!

@santisoler
Copy link
Member Author

santisoler commented Jan 31, 2023

@leouieda would you like to take a look at this approach for solving this issue? I think the backslash is a good compromise, but would love to hear your input.

It worth noting that I haven't changed the filename of the cached file. Currently it will contain only the portion of the original filename that follows the last slash (e.g. if the filename is my/file.zip, the cached file will be <HASH>-file.zip).
I think this is ok as it is, considering that when downloading the file from Zenodo webapp it uses this portion to name the file. But maybe we would like to add a test for this behaviour.

Use our own Zenodo repository (10.5281/zenodo.7632643) to test if
parse_url works as expected with the file name has a slash.
Add a new test that checks if DOIDownloaders works as expected when the
filename contains a slash. Uses our own Zenodo repository.
@santisoler
Copy link
Member Author

Well, this is trickier to solve than I expected. This patch works well with pooch.retrieve since the user is manually creating the string with the backslash. But any other method will be prone to fail.

For example, if we create a registry.txt file:

santisoler\/pooch-test-data-v1.zip md5:6cdda261f5646a4089966fd0bf505233

And we want to load it after creating a Pooch object:

import pooch

pup = pooch.create(
    path=pooch.os_cache("testing-zenodo"),
    base_url="doi:10.5281/zenodo.7632643/",
    registry=None,
)
registry_file = "registry.txt"
pup.load_registry(registry_file)

Then the following fetch call will fail, since the file is loaded in a way that the backslash is interpreted as an escape character, complaining that the file doesn't exist in the registry:

pup.fetch(r"santisoler\/pooch-test-data-v1.zip")

And this other call will also fail because the url will not be parse correctly:

pup.fetch(r"santisoler/pooch-test-data-v1.zip")

Moreover, populating the registry through the repository API (see #319) will also fail at some point.

Is there any other way we can solve this considering that these filenames are only present in Zenodo?

@santisoler
Copy link
Member Author

I think this patch has a flawed design: having to introduce the backslash makes it more difficult to use by the end user, but also creates a lot of corner cases that are prone to future errors.

I came out with a better solution in #340, where the parse_url function has a special case for Zenodo dois. With this new design there's no need to introduce the backslash, making it to work flawlessly with both retrieve and a Pooch instance with a loaded registry file. Moreover, I suspect it will also work with the new feature to populate the registry in #319.

Therefore, I'm closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error downloading file from sub-directory in zenodo repository
2 participants