Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New file exists #394

Merged
merged 10 commits into from
Jul 27, 2021
Merged

New file exists #394

merged 10 commits into from
Jul 27, 2021

Conversation

ch-kr
Copy link
Contributor

@ch-kr ch-kr commented Jul 12, 2021

Added function to check if large number of files exist (current function in repo is too slow to check more than a few files). Dan King wrote this function for when I needed to check a large number of files in a batch job

@ch-kr ch-kr requested review from mike-w-wilson, a team and jkgoodrich and removed request for mike-w-wilson and a team July 12, 2021 19:08
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work as is, just a couple suggestions

gnomad/utils/file_utils.py Outdated Show resolved Hide resolved
gnomad/utils/file_utils.py Outdated Show resolved Hide resolved
gnomad/utils/file_utils.py Outdated Show resolved Hide resolved
gnomad/utils/file_utils.py Outdated Show resolved Hide resolved
gnomad/utils/file_utils.py Outdated Show resolved Hide resolved
ch-kr and others added 2 commits July 26, 2021 14:39
Co-authored-by: jkgoodrich <33063077+jkgoodrich@users.noreply.github.com>
@ch-kr ch-kr requested review from jkgoodrich and danking July 26, 2021 19:11
gnomad/utils/file_utils.py Outdated Show resolved Hide resolved
pbar.update(1)
return x

return unapplied_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of context on this rigamarole:

When using asyncio, it's important to not have too many live coroutines. Every time you call an async function, you create a coroutine. For example, calling create_unapplied_function('foo') does not create a coroutine because we never invoke an async function. On the other hand, create_unapplied_function('foo')() does create a coroutine because it invokes the returned async function. A coroutine completes when we await it (or use one of the other asyncio waiting primitives, like asyncio.wait).

For this reason, bounded_gather accepts async functions as arguments instead of directly accepting coroutines. For example:

# this creates a coroutine for each argument to bad_bounded_gather
await bad_bounded_gather(async_do_something('argument'), ...)

# this allows bounded gather to bound the number of live coroutines (by waiting for
# previous coroutines to complete before invoking more)
async def async_delayed_do_something():
    return await async_do_something('argument')
await bounded_gather(async_delayed_do_something, ...)

In the first case, we create all the coroutines before even calling bad_bounded_gather. In the second case, bounded_gather can decide when to create coroutines (by invoking one of its arguments).


Usually, Hail team uses this pattern:

await bounded_gather(*[functools.partial(async_file_exists, fs, fname)
                       for fname in fnames],
                     parallelism=parallelism)

But we also want to update the progress bar after we check for file existence. I should maybe have called create_unapplied_function check_existence_and_update_pbar_thunk. I use the term "thunk" to refer to these zero-argument functions that delay the creation of coroutines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this detailed description Dan!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seconded, thank you for this!! I added a small note to the docstring, but let me know if you think I should add a longer summary @jkgoodrich

@ch-kr ch-kr requested a review from danking July 27, 2021 15:21
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ch-kr ch-kr merged commit 2147e58 into master Jul 27, 2021
@ch-kr ch-kr deleted the file_exists branch July 27, 2021 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants