enhance download_and_extract #8216

Jerome-Hsieh · 2024-11-16T20:49:42Z

Description

According to issue, the error messages are not very intuitive.
I think maybe we can check if the file name matches the downloaded file’s base name before starting the download.
If it doesn’t match, it will notify user.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

ericspod · 2024-11-16T21:06:58Z

Hi @Jerome-Hsieh If you're having issues with putting this PR together we can discuss how to resolve them, but please do not close and then open a new PR. Thanks for the contribution!

KumoLiu · 2024-11-18T04:18:01Z

monai/apps/utils.py

    with tempfile.TemporaryDirectory() as tmp_dir:
-        filename = filepath or Path(tmp_dir, _basename(url)).resolve()
+        if filepath:


Didn't get the idea for this change here. Seems equivalent to me.

The change I want to make sure if user don't set the filepath, the process can still work by using the default path

KumoLiu · 2024-11-18T04:25:18Z

monai/apps/utils.py

+    if filepath:
+        FilepathExtenstion = ''.join(Path(".", _basename(filepath)).resolve().suffixes)
+        if urlFilenameExtension != FilepathExtenstion:
+            raise NotImplementedError(


In my opinion, the issue here is that if filepath can directly get the name of the downloaded file, the download_and_extract function would work as expected, rather than only raising an error at this point.

I see, but my perspective is that an error with the filepath occurs when the program tries to write the downloaded file to an invalid filepath.
E.g.,filepath='./test' is invalid, filepath='./test.tar.gz' is valid
So I would like to validate the filepath at the very beginning to ensure it is valid.

I know your point here.
Do you think when user specify filepath='./test', it would be better to give a warning and automatically catch the filename?

cc @ericspod @Nic-Ma

I would warn instead of forcing extensions, it takes control away from people to force things like this that aren't strictly necessary.

But I have a question.
Is it necessary for the user to specify the file extension when setting the filepath for the downloaded file?

If the user gives a path without extension then you'll get a file with no extension, the user would have to account for that and provide an extension. You could add an extension if none is given (and warn that this happened), but if the user has given a different extension than what's expected, this should be a warning and not an error. If you add an extension you should also be careful not to overwrite an existing file. I realise this is getting rather complicated now, sorry!

monai/apps/utils.py

Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>

Jerome-Hsieh · 2024-12-04T16:35:51Z

@ericspod @KumoLiu
Here's newest change. like we discussed before. I check the filepath every time at the very beginning.
If no extension is provided, the system will automatically add the appropriate file extension with a warning.
A warning will also appear if provided extension doesn't match the downloaded file's extension .

KumoLiu · 2024-12-05T05:41:06Z

@ericspod @KumoLiu Here's newest change. like we discussed before. I check the filepath every time at the very beginning. If no extension is provided, the system will automatically add the appropriate file extension with a warning. A warning will also appear if provided extension doesn't match the downloaded file's extension .

Hi @Jerome-Hsieh, thank you for the quick update. I tested your latest changes, but neither "." nor "./test" as the file path seems to work. Is this the expected behavior? In my opinion, both should work. The logic could be adjusted so that if the file path is a directory and not empty, it automatically captures the name and downloads the file into the specified directory. What do you think?

url = "https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz"
monai.apps.utils.download_and_extract(url, filepath="./test")

Jerome-Hsieh · 2024-12-06T10:53:25Z

Hi @KumoLiu my test results：

>>> url = "https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz"
>>> monai.apps.utils.download_and_extract(url, filepath=".")
2024-12-06 18:06:56,333 - INFO - Expected md5 is None, skip md5 check for file ..
2024-12-06 18:06:56,334 - INFO - File exists: ., skipped downloading.
2024-12-06 18:06:56,334 - INFO - Non-empty folder exists in ., skipped extracting.

and

>>> monai.apps.utils.download_and_extract(url, filepath="./test")
2024-12-06 18:08:41,246 - WARNING - filepath=./test, which missing file extension. Auto-appending extension to: ./test.tar.gz
test.tar.gz: 59.0MB [00:10, 6.07MB/s]
2024-12-06 18:08:51,443 - INFO - Downloaded: test.tar.gz
2024-12-06 18:08:51,443 - INFO - Expected md5 is None, skip md5 check for file test.tar.gz.
2024-12-06 18:08:51,443 - INFO - Writing into directory: ..

If the results like above, that's expected behavior.

Your opinion sounds like
url=example.com/file.tar.gz, filepath=./test
the program will work ：./test/file.tar.gz.
If I don't misunderstanding your thoughts, this method although can automatically captures the name, not only capture the strange filename make user confused if the URL don't have appropriate naming, but also can't allow user set the filename they want.

mingxin-zheng · 2024-12-06T15:47:03Z

Regarding file downloading from URL, sometimes we can infer the name from the content deposition. Here is one example snippet: https://github.com/Project-MONAI/VLM/blob/7be688aa457a1806f908eb758f2f3ee816fea017/m3/demo/experts/utils.py#L135 @KumoLiu

Jerome-Hsieh · 2024-12-07T08:30:29Z

Thank @mingxin-zheng provide information to find file name.
But I want to say that at first my initial approach to this issue that filepath should be a compressed file rather than a directory.
I wanted to focus on making filepath valid. Method provided by @KumoLiu it's great, but which somewhat deviates from my original intent.
Sorry for not providing a clear explanation.

Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>

Jerome-Hsieh · 2024-12-15T11:05:19Z

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning.
Besides, I also add if set the wrong file extension, it will raise the error and stop downloading.
Please let me know if you have any suggestions. Thx!

for more information, see https://pre-commit.ci

KumoLiu · 2024-12-16T08:44:29Z

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning. Besides, I also add if set the wrong file extension, it will raise the error and stop downloading. Please let me know if you have any suggestions. Thx!

Hi @Jerome-Hsieh, thank you for the update. After our discussion in the development meeting, I realized that there is an output_dir argument in download_and_extract. I apologize for any confusion my previous comments may have caused. It would be great to proceed by considering only the extension as before. Sorry again for the misunderstanding.

KumoLiu · 2024-12-16T08:47:01Z

monai/apps/utils.py

@@ -27,6 +28,8 @@
 from urllib.parse import urlparse
 from urllib.request import urlopen, urlretrieve

+import requests


Regarding the ci error here: https://github.com/Project-MONAI/MONAI/actions/runs/12338243392/job/34432967743?pr=8216
Please consider using optional_import here and

MONAI/monai/bundle/scripts.py

Line 61 in 21920a3

requests_get, has_requests = optional_import("requests", name="get")

And skip in the related tests.

MONAI/tests/test_bundle_get_data.py

Line 46 in 21920a3

@SkipIfNoModule("requests")

Please also include related test! Thanks again.

Jerome-Hsieh · 2024-12-16T09:21:12Z

Hi @KumoLiu the newest commit I change the logic that if the file path is a directory and not empty, it will automatically capture the name from content deposition, if it doses't have any file name, will use url basename and download the file into the specified directory then gives a warning. Besides, I also add if set the wrong file extension, it will raise the error and stop downloading. Please let me know if you have any suggestions. Thx!

Hi @Jerome-Hsieh, thank you for the update. After our discussion in the development meeting, I realized that there is an output_dir argument in download_and_extract. I apologize for any confusion my previous comments may have caused. It would be great to proceed by considering only the extension as before. Sorry again for the misunderstanding.

Hi @KumoLiu I see, and what do you think about previous version a9a0171 I commit ?
Could you please identify any areas that require improvement or could be further enhance ? Thx!

KumoLiu · 2024-12-16T09:30:31Z

Hi @KumoLiu I see, and what do you think about previous version a9a0171 I commit ? Could you please identify any areas that require improvement or could be further enhance ? Thx!

Hi @Jerome-Hsieh, a9a0171 looks good. The only update needed is how to extract the file extension from the URL.

For example, given a URL like "https://drive.google.com/u/1/uc?id=1KntZge40tAHgyXmHYVqZZ5d2p_4Qr2l5&export=download", you can use get_filename_from_url to determine the exact extension based on the filename. What do you think? Please also include the tests, thanks!

KumoLiu reviewed Nov 18, 2024

View reviewed changes

ericspod reviewed Nov 27, 2024

View reviewed changes

monai/apps/utils.py Outdated Show resolved Hide resolved

ericspod reviewed Nov 27, 2024

View reviewed changes

monai/apps/utils.py Outdated Show resolved Hide resolved

Jerome-Hsieh added 3 commits December 4, 2024 23:32

enhance download_and_extract

e35a663

Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>

fix commit

9d3c395

Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>

fix some problem

0441871

Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>

Jerome-Hsieh force-pushed the issue5463 branch from 0511db5 to 0441871 Compare December 4, 2024 15:32

Enhance download_and_extract

a9a0171

Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>

Enhance download_and_extract

e70e59c

Signed-off-by: jerome_Hsieh <jerome910810@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7a26dcd

for more information, see https://pre-commit.ci

KumoLiu reviewed Dec 16, 2024

View reviewed changes

KumoLiu closed this Dec 16, 2024

KumoLiu reopened this Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance download_and_extract #8216

enhance download_and_extract #8216

Jerome-Hsieh commented Nov 16, 2024 •

edited

Loading

ericspod commented Nov 16, 2024

KumoLiu Nov 18, 2024

Jerome-Hsieh Nov 26, 2024

KumoLiu Nov 18, 2024

Jerome-Hsieh Nov 26, 2024

KumoLiu Nov 27, 2024

ericspod Nov 27, 2024

Jerome-Hsieh Nov 29, 2024

ericspod Dec 1, 2024

Jerome-Hsieh commented Dec 4, 2024

KumoLiu commented Dec 5, 2024

Jerome-Hsieh commented Dec 6, 2024

mingxin-zheng commented Dec 6, 2024

Jerome-Hsieh commented Dec 7, 2024

Jerome-Hsieh commented Dec 15, 2024

KumoLiu commented Dec 16, 2024

KumoLiu Dec 16, 2024

Jerome-Hsieh commented Dec 16, 2024 •

edited

Loading

KumoLiu commented Dec 16, 2024

enhance download_and_extract #8216

Are you sure you want to change the base?

enhance download_and_extract #8216

Conversation

Jerome-Hsieh commented Nov 16, 2024 • edited Loading

Description

Types of changes

ericspod commented Nov 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jerome-Hsieh commented Dec 4, 2024

KumoLiu commented Dec 5, 2024

Jerome-Hsieh commented Dec 6, 2024

mingxin-zheng commented Dec 6, 2024

Jerome-Hsieh commented Dec 7, 2024

Jerome-Hsieh commented Dec 15, 2024

KumoLiu commented Dec 16, 2024

Choose a reason for hiding this comment

Jerome-Hsieh commented Dec 16, 2024 • edited Loading

KumoLiu commented Dec 16, 2024

Jerome-Hsieh commented Nov 16, 2024 •

edited

Loading

Jerome-Hsieh commented Dec 16, 2024 •

edited

Loading