Support DOIs with slashes in the filename #337

santisoler · 2023-01-30T20:28:19Z

Allow DOI downloaders to work with files uploaded to Zenodo that have a forward slash "/" in their name. These special cases are generated by the Zenodo-GitHub integration service, where the first portion of the filename matches the GitHub handle of the repository owned and the last portion matches the repository name. When creating the url to these repositories, the forward slashes that are part of the filename should be led by a backslash. For example: url = "doi:10.123/zenodo.123123/my-file\/has-a-slash.zip".

Relevant issues/PRs:

Fixes #336

Todo

In case this patch works, we need to:

Try to add a file with a slash to our test repos in:
- Zenodo: 10.5281/zenodo.7632643
  - It was created through the Zenodo-GitHub integration service. By default the zip file includes the user handle, a forward slash and the repo name.
Update tests to see if we can download them
Improve docs to instruct on how to use this feature
- Extend the docstring of DOIDownloader
- Extend the docstring of retrieve
- Add an example in the user guide

Allow DOI downloaders to work with files uploaded to services like Zenodo, Figshare or Dataverse that have a forward slash in their name.

santisoler · 2023-01-30T20:43:53Z

@djmannion

With the patch in this PR you should be able to download your zip file that have a slash in its name through:

import pooch

raw_data_path = pooch.retrieve(
    url=r"doi:10.5281/zenodo.7347607/Wild-Minds\/GreatApeDictionary-v1.1.zip",
    known_hash="md5:8c40d1bce25619548f4daa16d63f36a3",
)

I decided to use a leading backslash before the offending slash in the filename to separate it from the other slashes that are part of the url. In order to use a backslash in a Python string without meaning to use it as a escape character we need to use r-strings with r"".

It works locally for me, does it for you?

djmannion · 2023-01-31T00:21:06Z

Yes, that works for me now - thanks!

santisoler · 2023-01-31T17:24:55Z

~~@leouieda would you like to take a look at this approach for solving this issue?~~ I think the backslash is a good compromise, but would love to hear your input.

It worth noting that I haven't changed the filename of the cached file. Currently it will contain only the portion of the original filename that follows the last slash (e.g. if the filename is my/file.zip, the cached file will be <HASH>-file.zip).
I think this is ok as it is, considering that when downloading the file from Zenodo webapp it uses this portion to name the file. But maybe we would like to add a test for this behaviour.

Use our own Zenodo repository (10.5281/zenodo.7632643) to test if parse_url works as expected with the file name has a slash.

Add a new test that checks if DOIDownloaders works as expected when the filename contains a slash. Uses our own Zenodo repository.

…-with-slashes

santisoler · 2023-02-11T19:10:50Z

Well, this is trickier to solve than I expected. This patch works well with pooch.retrieve since the user is manually creating the string with the backslash. But any other method will be prone to fail.

For example, if we create a registry.txt file:

santisoler\/pooch-test-data-v1.zip md5:6cdda261f5646a4089966fd0bf505233

And we want to load it after creating a Pooch object:

import pooch

pup = pooch.create(
    path=pooch.os_cache("testing-zenodo"),
    base_url="doi:10.5281/zenodo.7632643/",
    registry=None,
)
registry_file = "registry.txt"
pup.load_registry(registry_file)

Then the following fetch call will fail, since the file is loaded in a way that the backslash is interpreted as an escape character, complaining that the file doesn't exist in the registry:

pup.fetch(r"santisoler\/pooch-test-data-v1.zip")

And this other call will also fail because the url will not be parse correctly:

pup.fetch(r"santisoler/pooch-test-data-v1.zip")

Moreover, populating the registry through the repository API (see #319) will also fail at some point.

Is there any other way we can solve this considering that these filenames are only present in Zenodo?

santisoler · 2023-02-11T20:02:51Z

I think this patch has a flawed design: having to introduce the backslash makes it more difficult to use by the end user, but also creates a lot of corner cases that are prone to future errors.

I came out with a better solution in #340, where the parse_url function has a special case for Zenodo dois. With this new design there's no need to introduce the backslash, making it to work flawlessly with both retrieve and a Pooch instance with a loaded registry file. Moreover, I suspect it will also work with the new feature to populate the registry in #319.

Therefore, I'm closing this PR.

Support DOIs with slashes in the filename

8192760

Allow DOI downloaders to work with files uploaded to services like Zenodo, Figshare or Dataverse that have a forward slash in their name.

Fix typo in test

9bbf8a7

santisoler requested a review from leouieda January 31, 2023 17:25

santisoler added 4 commits February 11, 2023 09:51

Merge branch 'main' into doi-with-slashes

7dbc55b

Update test for parsing url using own Zenodo repo

2e34f60

Use our own Zenodo repository (10.5281/zenodo.7632643) to test if parse_url works as expected with the file name has a slash.

Test DOIDownloader when the filename has a slash

6c3ecf3

Add a new test that checks if DOIDownloaders works as expected when the filename contains a slash. Uses our own Zenodo repository.

Merge branch 'doi-with-slashes' of github.com:fatiando/pooch into doi…

b026fce

…-with-slashes

santisoler mentioned this pull request Feb 11, 2023

Support Zenodo urls with slashes in the filename #340

Merged

santisoler closed this Feb 11, 2023

santisoler mentioned this pull request Feb 14, 2023

Add test for load_registry_from_doi when the doi has a slash in the filename #341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DOIs with slashes in the filename #337

Support DOIs with slashes in the filename #337

santisoler commented Jan 30, 2023 •

edited

Loading

santisoler commented Jan 30, 2023 •

edited

Loading

djmannion commented Jan 31, 2023

santisoler commented Jan 31, 2023 •

edited

Loading

santisoler commented Feb 11, 2023

santisoler commented Feb 11, 2023

Support DOIs with slashes in the filename #337

Support DOIs with slashes in the filename #337

Conversation

santisoler commented Jan 30, 2023 • edited Loading

santisoler commented Jan 30, 2023 • edited Loading

djmannion commented Jan 31, 2023

santisoler commented Jan 31, 2023 • edited Loading

santisoler commented Feb 11, 2023

santisoler commented Feb 11, 2023

santisoler commented Jan 30, 2023 •

edited

Loading

santisoler commented Jan 30, 2023 •

edited

Loading

santisoler commented Jan 31, 2023 •

edited

Loading