Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance download_and_extract #8216

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

Jerome-Hsieh
Copy link

@Jerome-Hsieh Jerome-Hsieh commented Nov 16, 2024

Fixes #5463

Description

According to issue, the error messages are not very intuitive.
I think maybe we can check if the file name matches the downloaded file’s base name before starting the download.
If it doesn’t match, it will notify user.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@ericspod
Copy link
Member

Hi @Jerome-Hsieh If you're having issues with putting this PR together we can discuss how to resolve them, but please do not close and then open a new PR. Thanks for the contribution!

with tempfile.TemporaryDirectory() as tmp_dir:
filename = filepath or Path(tmp_dir, _basename(url)).resolve()
if filepath:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get the idea for this change here. Seems equivalent to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change I want to make sure if user don't set the filepath, the process can still work by using the default path

if filepath:
FilepathExtenstion = ''.join(Path(".", _basename(filepath)).resolve().suffixes)
if urlFilenameExtension != FilepathExtenstion:
raise NotImplementedError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the issue here is that if filepath can directly get the name of the downloaded file, the download_and_extract function would work as expected, rather than only raising an error at this point.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but my perspective is that an error with the filepath occurs when the program tries to write the downloaded file to an invalid filepath.
E.g.,filepath='./test' is invalid, filepath='./test.tar.gz' is valid
So I would like to validate the filepath at the very beginning to ensure it is valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know your point here.
Do you think when user specify filepath='./test', it would be better to give a warning and automatically catch the filename?

cc @ericspod @Nic-Ma

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would warn instead of forcing extensions, it takes control away from people to force things like this that aren't strictly necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I have a question.
Is it necessary for the user to specify the file extension when setting the filepath for the downloaded file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user gives a path without extension then you'll get a file with no extension, the user would have to account for that and provide an extension. You could add an extension if none is given (and warn that this happened), but if the user has given a different extension than what's expected, this should be a warning and not an error. If you add an extension you should also be careful not to overwrite an existing file. I realise this is getting rather complicated now, sorry!

monai/apps/utils.py Outdated Show resolved Hide resolved
monai/apps/utils.py Outdated Show resolved Hide resolved
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
@Jerome-Hsieh
Copy link
Author

@ericspod @KumoLiu
Here's newest change. like we discussed before. I check the filepath every time at the very beginning.
If no extension is provided, the system will automatically add the appropriate file extension with a warning.
A warning will also appear if provided extension doesn't match the downloaded file's extension .

@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 5, 2024

@ericspod @KumoLiu Here's newest change. like we discussed before. I check the filepath every time at the very beginning. If no extension is provided, the system will automatically add the appropriate file extension with a warning. A warning will also appear if provided extension doesn't match the downloaded file's extension .

Hi @Jerome-Hsieh, thank you for the quick update. I tested your latest changes, but neither "." nor "./test" as the file path seems to work. Is this the expected behavior? In my opinion, both should work. The logic could be adjusted so that if the file path is a directory and not empty, it automatically captures the name and downloads the file into the specified directory. What do you think?

url = "https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz"
monai.apps.utils.download_and_extract(url, filepath="./test")

@Jerome-Hsieh
Copy link
Author

Hi @KumoLiu my test results:

>>> url = "https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz"
>>> monai.apps.utils.download_and_extract(url, filepath=".")
2024-12-06 18:06:56,333 - INFO - Expected md5 is None, skip md5 check for file ..
2024-12-06 18:06:56,334 - INFO - File exists: ., skipped downloading.
2024-12-06 18:06:56,334 - INFO - Non-empty folder exists in ., skipped extracting.

and

>>> monai.apps.utils.download_and_extract(url, filepath="./test")
2024-12-06 18:08:41,246 - WARNING - filepath=./test, which missing file extension. Auto-appending extension to: ./test.tar.gz
test.tar.gz: 59.0MB [00:10, 6.07MB/s]
2024-12-06 18:08:51,443 - INFO - Downloaded: test.tar.gz
2024-12-06 18:08:51,443 - INFO - Expected md5 is None, skip md5 check for file test.tar.gz.
2024-12-06 18:08:51,443 - INFO - Writing into directory: ..

If the results like above, that's expected behavior.

Your opinion sounds like
url=example.com/file.tar.gz, filepath=./test
the program will work :./test/file.tar.gz.
If I don't misunderstanding your thoughts, this method although can automatically captures the name, not only capture the strange filename make user confused if the URL don't have appropriate naming, but also can't allow user set the filename they want.

@mingxin-zheng
Copy link
Contributor

Regarding file downloading from URL, sometimes we can infer the name from the content deposition. Here is one example snippet: https://github.com/Project-MONAI/VLM/blob/7be688aa457a1806f908eb758f2f3ee816fea017/m3/demo/experts/utils.py#L135 @KumoLiu

@Jerome-Hsieh
Copy link
Author

Thank @mingxin-zheng provide information to find file name.
But I want to say that at first my initial approach to this issue that filepath should be a compressed file rather than a directory.
I wanted to focus on making filepath valid. Method provided by @KumoLiu it's great, but which somewhat deviates from my original intent.
Sorry for not providing a clear explanation.

Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>
@Jerome-Hsieh
Copy link
Author

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning.
Besides, I also add if set the wrong file extension, it will raise the error and stop downloading.
Please let me know if you have any suggestions. Thx!

@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 16, 2024

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning. Besides, I also add if set the wrong file extension, it will raise the error and stop downloading. Please let me know if you have any suggestions. Thx!

Hi @Jerome-Hsieh, thank you for the update. After our discussion in the development meeting, I realized that there is an output_dir argument in download_and_extract. I apologize for any confusion my previous comments may have caused. It would be great to proceed by considering only the extension as before. Sorry again for the misunderstanding.

@@ -27,6 +28,8 @@
from urllib.parse import urlparse
from urllib.request import urlopen, urlretrieve

import requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the ci error here: https://github.com/Project-MONAI/MONAI/actions/runs/12338243392/job/34432967743?pr=8216
Please consider using optional_import here and

requests_get, has_requests = optional_import("requests", name="get")

And skip in the related tests.

@SkipIfNoModule("requests")

Please also include related test! Thanks again.

@Jerome-Hsieh
Copy link
Author

Jerome-Hsieh commented Dec 16, 2024

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning. Besides, I also add if set the wrong file extension, it will raise the error and stop downloading. Please let me know if you have any suggestions. Thx!

Hi @Jerome-Hsieh, thank you for the update. After our discussion in the development meeting, I realized that there is an output_dir argument in download_and_extract. I apologize for any confusion my previous comments may have caused. It would be great to proceed by considering only the extension as before. Sorry again for the misunderstanding.

Hi @KumoLiu I see, and what do you think about previous version a9a0171 I commit ?
Could you please identify any areas that require improvement or could be further enhance ? Thx!

@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 16, 2024

Hi @KumoLiu I see, and what do you think about previous version a9a0171 I commit ? Could you please identify any areas that require improvement or could be further enhance ? Thx!

Hi @Jerome-Hsieh, a9a0171 looks good. The only update needed is how to extract the file extension from the URL.

For example, given a URL like "https://drive.google.com/u/1/uc?id=1KntZge40tAHgyXmHYVqZZ5d2p_4Qr2l5&export=download", you can use get_filename_from_url to determine the exact extension based on the filename. What do you think? Please also include the tests, thanks!

@KumoLiu KumoLiu closed this Dec 16, 2024
@KumoLiu KumoLiu reopened this Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parameter filepath of download_and_extract
4 participants