Add retry to shuffle broadcast #8900

fjetter · 2024-10-18T10:27:22Z

One of our P2P stress tests is failing pretty regularly distributed.tests.test_stress.test_close_connections

This failure is somewhat expected because the broadcast is connecting to all workers and the connection attempts may time out if the workers are too busy.

This is not an ideal fix (instead, removing broadcast would be terrific).

I opted to not implement a retry in broadcast itself since this would've more serious implications and the broadcast API provides everything for users to implement this themselves

I'm still missing a test but confirmed this with some manual patches. @hendrikmakait if you have an idea on how to put together an easy test this would be helpful

github-actions · 2024-10-18T11:21:19Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

25 files ± 0 25 suites ±0 10h 29m 27s ⏱️ + 7m 4s
4 128 tests + 7 4 013 ✅ + 8 110 💤 ±0 4 ❌ - 2 1 🔥 +1
47 669 runs +71 45 567 ✅ +72 2 087 💤 ±0 14 ❌ - 2 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit fc6095d. ± Comparison against base commit 1205a70.

♻️ This comment has been updated with latest results.

hendrikmakait · 2024-10-18T11:25:17Z

Closes #8659

fjetter · 2024-10-21T09:16:33Z

This got a little more ugly than I wanted it to but now we're also distinguishing real from OSError

fjetter · 2024-10-21T09:19:59Z

distributed/shuffle/tests/test_shuffle.py

+        ({0: (5, OSError), 1: (1, OSError)}, P2PConsistencyError),
+    ],
+)
+@pytest.mark.slow


All included this is about 2.5s so not incredibly slow. The backoff is just 0.1 after all. If that issue persists we can add configuration and an exponential backoff, etc. but I'd rather not increase complexity further (but rather would like to get rid of the broadcast instead)

hendrikmakait · 2024-10-25T09:49:24Z

distributed/shuffle/_scheduler_plugin.py

+                if isinstance(r, OSError):
+                    workers.append(w)
+                else:
+                    raise P2PConsistencyError(


Why raise a P2PConsistencyError here? I think we should just propagate the original error instead.

hendrikmakait · 2024-10-25T09:50:12Z

distributed/shuffle/_scheduler_plugin.py

+                if len(workers) == before:
+                    no_progress += 1
+                    if no_progress >= 3:
+                        raise P2PConsistencyError(


I would raise a different type of error here. The P2P run is still internally consistent, it just failed.

are you fine with a generic RuntimeError?

Yes, that would work. For context, I've imagined the P2PConsistencyError to be exclusively used for state/runtime inconsistencies. For transfer and unpack tasks, these actually cause us to reschedule the task instead of erring. (Come to think of it, I'm wondering if we should reschedule when encountering these in the barrier as well.)

(Come to think of it, I'm wondering if we should reschedule when encountering these in the barrier as well.)

I think the retry should be sufficient for now. Either way, this is out of scope for this PR

I just checked and realized that we do not reschedule upon P2PConsistencyErrors but raise hard. We only reschedule on ShuffleClosedError.

hendrikmakait

Thanks, @fjetter!

fjetter marked this pull request as ready for review October 21, 2024 09:16

Add retry to shuffle broadcast

04e74c9

fjetter force-pushed the shuffle_broadcast_retry branch from b81591b to 04e74c9 Compare October 21, 2024 09:17

mark slow

6da0b5e

fjetter commented Oct 21, 2024

View reviewed changes

hendrikmakait reviewed Oct 25, 2024

View reviewed changes

fjetter added 3 commits October 25, 2024 13:14

use RuntimeError

8e8b7c8

user RuntimError

6769bef

Merge branch 'main' into shuffle_broadcast_retry

fc6095d

hendrikmakait approved these changes Oct 25, 2024

View reviewed changes

hendrikmakait merged commit afa6e8d into dask:main Oct 25, 2024
16 of 30 checks passed

fjetter deleted the shuffle_broadcast_retry branch November 7, 2024 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry to shuffle broadcast #8900

Add retry to shuffle broadcast #8900

fjetter commented Oct 18, 2024

github-actions bot commented Oct 18, 2024 •

edited

Loading

hendrikmakait commented Oct 18, 2024

fjetter commented Oct 21, 2024

fjetter Oct 21, 2024

hendrikmakait Oct 25, 2024

hendrikmakait Oct 25, 2024

fjetter Oct 25, 2024

hendrikmakait Oct 25, 2024

fjetter Oct 25, 2024

hendrikmakait Oct 25, 2024

hendrikmakait left a comment

Add retry to shuffle broadcast #8900

Add retry to shuffle broadcast #8900

Conversation

fjetter commented Oct 18, 2024

github-actions bot commented Oct 18, 2024 • edited Loading

Unit Test Results

hendrikmakait commented Oct 18, 2024

fjetter commented Oct 21, 2024

fjetter Oct 21, 2024

Choose a reason for hiding this comment

hendrikmakait Oct 25, 2024

Choose a reason for hiding this comment

hendrikmakait Oct 25, 2024

Choose a reason for hiding this comment

fjetter Oct 25, 2024

Choose a reason for hiding this comment

hendrikmakait Oct 25, 2024

Choose a reason for hiding this comment

fjetter Oct 25, 2024

Choose a reason for hiding this comment

hendrikmakait Oct 25, 2024

Choose a reason for hiding this comment

hendrikmakait left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 18, 2024 •

edited

Loading