-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New file exists #394
New file exists #394
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will work as is, just a couple suggestions
Co-authored-by: jkgoodrich <33063077+jkgoodrich@users.noreply.github.com>
pbar.update(1) | ||
return x | ||
|
||
return unapplied_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of context on this rigamarole:
When using asyncio, it's important to not have too many live coroutines. Every time you call an async function, you create a coroutine. For example, calling create_unapplied_function('foo')
does not create a coroutine because we never invoke an async function. On the other hand, create_unapplied_function('foo')()
does create a coroutine because it invokes the returned async function. A coroutine completes when we await
it (or use one of the other asyncio waiting primitives, like asyncio.wait
).
For this reason, bounded_gather
accepts async functions as arguments instead of directly accepting coroutines. For example:
# this creates a coroutine for each argument to bad_bounded_gather
await bad_bounded_gather(async_do_something('argument'), ...)
# this allows bounded gather to bound the number of live coroutines (by waiting for
# previous coroutines to complete before invoking more)
async def async_delayed_do_something():
return await async_do_something('argument')
await bounded_gather(async_delayed_do_something, ...)
In the first case, we create all the coroutines before even calling bad_bounded_gather
. In the second case, bounded_gather
can decide when to create coroutines (by invoking one of its arguments).
Usually, Hail team uses this pattern:
await bounded_gather(*[functools.partial(async_file_exists, fs, fname)
for fname in fnames],
parallelism=parallelism)
But we also want to update the progress bar after we check for file existence. I should maybe have called create_unapplied_function
check_existence_and_update_pbar_thunk
. I use the term "thunk" to refer to these zero-argument functions that delay the creation of coroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this detailed description Dan!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seconded, thank you for this!! I added a small note to the docstring, but let me know if you think I should add a longer summary @jkgoodrich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Added function to check if large number of files exist (current function in repo is too slow to check more than a few files). Dan King wrote this function for when I needed to check a large number of files in a batch job