Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry to shuffle broadcast #8900

Merged
merged 5 commits into from
Oct 25, 2024

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Oct 18, 2024

One of our P2P stress tests is failing pretty regularly distributed.tests.test_stress.test_close_connections

image

This failure is somewhat expected because the broadcast is connecting to all workers and the connection attempts may time out if the workers are too busy.

This is not an ideal fix (instead, removing broadcast would be terrific).

I opted to not implement a retry in broadcast itself since this would've more serious implications and the broadcast API provides everything for users to implement this themselves

I'm still missing a test but confirmed this with some manual patches. @hendrikmakait if you have an idea on how to put together an easy test this would be helpful

Copy link
Contributor

github-actions bot commented Oct 18, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    25 files  ± 0      25 suites  ±0   10h 29m 27s ⏱️ + 7m 4s
 4 128 tests + 7   4 013 ✅ + 8    110 💤 ±0   4 ❌  - 2  1 🔥 +1 
47 669 runs  +71  45 567 ✅ +72  2 087 💤 ±0  14 ❌  - 2  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit fc6095d. ± Comparison against base commit 1205a70.

♻️ This comment has been updated with latest results.

@hendrikmakait
Copy link
Member

Closes #8659

@fjetter fjetter marked this pull request as ready for review October 21, 2024 09:16
@fjetter
Copy link
Member Author

fjetter commented Oct 21, 2024

This got a little more ugly than I wanted it to but now we're also distinguishing real from OSError

@fjetter fjetter force-pushed the shuffle_broadcast_retry branch from b81591b to 04e74c9 Compare October 21, 2024 09:17
({0: (5, OSError), 1: (1, OSError)}, P2PConsistencyError),
],
)
@pytest.mark.slow
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All included this is about 2.5s so not incredibly slow. The backoff is just 0.1 after all. If that issue persists we can add configuration and an exponential backoff, etc. but I'd rather not increase complexity further (but rather would like to get rid of the broadcast instead)

if isinstance(r, OSError):
workers.append(w)
else:
raise P2PConsistencyError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why raise a P2PConsistencyError here? I think we should just propagate the original error instead.

if len(workers) == before:
no_progress += 1
if no_progress >= 3:
raise P2PConsistencyError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would raise a different type of error here. The P2P run is still internally consistent, it just failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you fine with a generic RuntimeError?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would work. For context, I've imagined the P2PConsistencyError to be exclusively used for state/runtime inconsistencies. For transfer and unpack tasks, these actually cause us to reschedule the task instead of erring. (Come to think of it, I'm wondering if we should reschedule when encountering these in the barrier as well.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Come to think of it, I'm wondering if we should reschedule when encountering these in the barrier as well.)

I think the retry should be sufficient for now. Either way, this is out of scope for this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked and realized that we do not reschedule upon P2PConsistencyErrors but raise hard. We only reschedule on ShuffleClosedError.

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @fjetter!

@hendrikmakait hendrikmakait merged commit afa6e8d into dask:main Oct 25, 2024
16 of 30 checks passed
@fjetter fjetter deleted the shuffle_broadcast_retry branch November 7, 2024 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants