-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry to shuffle broadcast #8900
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 25 files ± 0 25 suites ±0 10h 29m 27s ⏱️ + 7m 4s For more details on these failures and errors, see this check. Results for commit fc6095d. ± Comparison against base commit 1205a70. ♻️ This comment has been updated with latest results. |
Closes #8659 |
This got a little more ugly than I wanted it to but now we're also distinguishing real from OSError |
b81591b
to
04e74c9
Compare
({0: (5, OSError), 1: (1, OSError)}, P2PConsistencyError), | ||
], | ||
) | ||
@pytest.mark.slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All included this is about 2.5s so not incredibly slow. The backoff is just 0.1 after all. If that issue persists we can add configuration and an exponential backoff, etc. but I'd rather not increase complexity further (but rather would like to get rid of the broadcast instead)
if isinstance(r, OSError): | ||
workers.append(w) | ||
else: | ||
raise P2PConsistencyError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why raise a P2PConsistencyError
here? I think we should just propagate the original error instead.
if len(workers) == before: | ||
no_progress += 1 | ||
if no_progress >= 3: | ||
raise P2PConsistencyError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would raise a different type of error here. The P2P run is still internally consistent, it just failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you fine with a generic RuntimeError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would work. For context, I've imagined the P2PConsistencyError
to be exclusively used for state/runtime inconsistencies. For transfer and unpack tasks, these actually cause us to reschedule the task instead of erring. (Come to think of it, I'm wondering if we should reschedule when encountering these in the barrier as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Come to think of it, I'm wondering if we should reschedule when encountering these in the barrier as well.)
I think the retry should be sufficient for now. Either way, this is out of scope for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked and realized that we do not reschedule upon P2PConsistencyError
s but raise hard. We only reschedule on ShuffleClosedError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @fjetter!
One of our P2P stress tests is failing pretty regularly
distributed.tests.test_stress.test_close_connections
This failure is somewhat expected because the broadcast is connecting to all workers and the connection attempts may time out if the workers are too busy.
This is not an ideal fix (instead, removing broadcast would be terrific).
I opted to not implement a retry in broadcast itself since this would've more serious implications and the broadcast API provides everything for users to implement this themselves
I'm still missing a test but confirmed this with some manual patches. @hendrikmakait if you have an idea on how to put together an easy test this would be helpful