Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT+RF of zarr downloads: do not wait for full files listing + compute %done from total zarr size #1443

Merged
merged 4 commits into from
May 30, 2024

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented May 14, 2024

Individual commits have more description.

Closes: #1407 . @kabilar could you please check if works for you ok'ish?

edit 1: I think initial OPT commit is not entirely complete -- somewhere in the sandwich of all the generators there is still waiting for an iterable to exhaust itself.

…ating downloads

Delay assignment of file_qty until it is known, but otherwise proceed to
initiating downloads
@yarikoptic yarikoptic added the patch Increment the patch version when merged label May 14, 2024
Since maxsize is dynamically computed as we go through the files.
The idea, I guess, was that it would grow rapidly before actual
downloads commense but it is not the case, so we endup with done%
being always close to 100% since we get those reports on final
downloads completed close to when individual files are downloaded.

So this should close #1407 .

But for total zarr file to be used, we needed to account also for
skipped files.  I added reporting of sizes for skipped files as well.
It seems there is no negative side effect on regular files download.
So now for the %done of zarr we might be getting to 100% of original
size having downloaded nothing.  But IMHO it is ok since user does
not care as much of how many "subparts" are downloaded, but rather
to have adequate progress report back.

There also could be side effects if  -e skip  and we skip
download of some updated files which would be smaller than the local
ones so altogether we would get over 100% total at the end.
Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Project coverage is 88.63%. Comparing base (722f1b6) to head (08a4050).
Report is 54 commits behind head on master.

Files with missing lines Patch % Lines
dandi/download.py 96.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1443      +/-   ##
==========================================
+ Coverage   88.61%   88.63%   +0.01%     
==========================================
  Files          77       77              
  Lines       10563    10572       +9     
==========================================
+ Hits         9360     9370      +10     
+ Misses       1203     1202       -1     
Flag Coverage Δ
unittests 88.63% <96.15%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dandi/download.py Outdated Show resolved Hide resolved
dandi/download.py Outdated Show resolved Hide resolved
dandi/download.py Outdated Show resolved Hide resolved
final_out: dict | None = None
with interleave(
[pairing(p, gen) for p, gen in download_gens.items()],
downloads_gen(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interleave() evaluates/consumes the iterable in its entirety before returning, so you haven't accomplished anything by making downloads_gen() into a generator.

In order to add downloads to interleave() as they're yielded by asset.iterfiles(), you'll need to:

  • Replace the call to interleave() with a directly-constructed Interleaver instance
  • In separate thread, call the Interleaver's submit() method whenever a new file is yielded by asset.iterfiles(), and then call its finalize() method at end of iteration.

See interleave's docs for more information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you be so kind to approach this or otherwise address this issue the way you like it while also ideally reducing that delay / not demanding full listing to arrive to start the download?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or may be we could merge this (as resolves the issue) and then improve on top of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the error handling right for the solution I suggested would be very difficult, so we should probably just merge the current version for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's proceed! Do you want me to create a dedicated issue or you keep this on your plate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a dedicated issue.

Co-authored-by: John T. Wodder II <jwodder@users.noreply.github.com>
@yarikoptic yarikoptic merged commit 59f5da0 into master May 30, 2024
28 checks passed
@yarikoptic yarikoptic deleted the bf-rf-zarr-download branch May 30, 2024 13:15
@kabilar
Copy link
Member

kabilar commented May 30, 2024

Thank you @yarikoptic and @jwodder.

Copy link

🚀 PR was released in 0.62.2 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd-download patch Increment the patch version when merged released zarr
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DONE% value when downloading a Zarr asset is not correct
4 participants