rebalance() resilience to computations #4968

crusaderky · 2021-06-24T14:42:49Z

Partial fix for #4906

In scope

Let rebalance() gracefully handle all possible race conditions that could be caused by a computation running at the same time on the cluster
Thorough unit test coverage for all the above race condition cases
Code deduplication with replicate(), which also grants it increased resiliency

Out of scope, left to future PRs

Let computation gracefully handle having keys suddendly disappear due to a rebalance()
Proper resiliency review of replicate()
Preempt most common race conditions between rebalance() and computation; namely skip in-memory tasks that are a dependency of a queued or running task
Investigate bulk vs. ad-hoc scheduler<->worker comms. At the moment the code uses Server.rpc like it did before.

CC @mrocklin @jrbourbeau @fjetter @gjoseph92

distributed/scheduler.py

mrocklin · 2021-06-25T12:43:24Z

distributed/scheduler.py

+        """
+        result = await retry_operation(
+            self.rpc(addr=worker_address).gather, who_has=who_has
+        )


What happens if the worker disconnects?

I'm working to bake into self.rpc graceful handling for it (in scope for this PR but not yet in - the unit tests for it are already there though).

it's now in

distributed/scheduler.py

fjetter · 2021-06-25T13:05:18Z

distributed/scheduler.py

-        return {"status": "OK"}
+        missing_keys = {k for r in failed_keys_by_recipient.values() for k in r}
+        if missing_keys:
+            return {"status": "missing-data", "keys": list(missing_keys)}


This is currently forwarded to the user/caller of replicate/rebalance, correct? We should probably document what exactly missing-data in this context means. IIUC this does not necessarily mean that the task is lost but merely that it was not where we expected it to be, for whatever reason.

I've changed Client.rebalance to raise KeyError if it receives missing-data AND the client explicitly listed futures to be rebalanced; if futures were not specified the status message will be ignored.

Correction: the logic described above is actually in Scheduler.rebalance

fjetter · 2021-06-25T13:11:10Z

distributed/scheduler.py

+        await asyncio.gather(
+            *(self._delete_worker_data(r, v) for r, v in to_senders.items())
+        )


what happens to dependents of the to-be-deleted keys on the worker? IIUC the worker state machine is currently not equipped to deal with the necessary transitions required for something like this. The necessary transitions on worker side would be

transition(ts, "memory" -> "fetch") for dep in ts.dependents: transition(dep, "ready" -> "waiting")

The delete/free/release on worker side is not that sophisticated. I remember KeyErrors popping up while debugging the deadlocks recently if I removed keys too eagerly

Resilience of computation to having needed keys removed from under its feet is out of scope for this PR. As of this PR, it is still not robust to run rebalance() during a compute, in the sense that rebalance will no longer crash but compute still will.

crusaderky · 2021-07-01T15:38:21Z

distributed/tests/test_scheduler.py

@@ -329,7 +329,6 @@ async def test_remove_worker_from_scheduler(s, a, b):
    await s.remove_worker(address=a.address)
    assert a.address not in s.nthreads
    assert len(s.workers[b.address].processing) == len(dsk)  # b owns everything
-    s.validate_state()


Redundant - it's already called by the gen_cluster cleanup code.

crusaderky · 2021-07-02T12:34:41Z

@fjetter @mrocklin @jrbourbeau ready for final review. Unit tests have been extensively stress-tested.

fjetter

Apart from a few nitpicks the big things I want to discuss and settle before merging are

Error handling around asyncio.gather
use of transitions instead of plain del parent.tasks

otherwise changes LGTM

distributed/tests/test_worker.py

distributed/worker.py

distributed/scheduler.py

fjetter · 2021-07-06T11:55:54Z

distributed/scheduler.py

@@ -6073,19 +6146,14 @@ async def replicate(
                            wws._address for wws in ts._who_has
                        ]

-                results = await asyncio.gather(
+                await asyncio.gather(


In case of an exception on one of the _gather_on_worker tasks, the exception is, by default, raised immediately. Even though an exception is raised the not yet completed tasks will continue running and are not cancelled. Regardless of whether we cancel them or not, we will always loose the result of the successful ones. If any of the gather_on_worker results fail we'd not get any logs.

I would suggest to either move all the log_event calls in the coros themselves such that they log the event themselves or add the argument return_exceptions=True to asyncio.gather and handle the exceptions explicitly here.

The latter would also reduce the chance of introducing subtle bugs once the logic below this gather becomes more complicated (if ever). I don't have a strong opinion about which way we go but I think we should handle this case properly.

Neither _gather_on_worker nor _delete_worker_data ever raise exceptions though. Literally the only case where they do is event loop shutdown.
Added comments to highlight this.

Well, that's not 100% true. At the very least _delete_worker_data performs state transitions which may raise. That likely doesn't justify dedicated exception handling here so I'm good.

distributed/scheduler.py

distributed/tests/test_scheduler.py

crusaderky · 2021-07-13T11:35:22Z

distributed/tests/test_worker.py

-
-    await cc.close_rpc()
-
-


This is old garbage. The 'delete_data' does not exist. Note the 'dont_' prefix in the function - this was never executed.

crusaderky · 2021-07-13T12:05:23Z

@fjetter all review comments have been addressed

crusaderky · 2021-07-14T11:21:01Z

@fjetter synced with main and ready for merge as soon as tests pass

crusaderky added 16 commits June 21, 2021 10:41

Polish

07e7762

polish

1c6dcd4

polish

7474f2a

Test stubs

ed7aa14

resilient Worker.gather

b9df870

WIP unified rebalance/replicate op

723bda1

Merge branch 'main' into rebalance_in_compute

7597eaa

bugfix

7db3d0c

Merge remote-tracking branch 'upstream/main' into rebalance_in_compute

3eb9a99

spelling

62202f8

fix failing test

7ab13e7

More stubs

7f07eda

Merge branch 'main' into rebalance_in_compute

747dfc0

tests implementation

b1af59e

tests

f3bb5c4

tests

f29ce49

mrocklin reviewed Jun 24, 2021

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

crusaderky added 4 commits June 25, 2021 10:53

Merge branch 'main' into rebalance_in_compute

bd158ec

Refactor _gather_on_workers to run on a single worker

17e3b70

fix

c3e881d

tests

600ec32

mrocklin reviewed Jun 25, 2021

View reviewed changes

fjetter reviewed Jun 25, 2021

View reviewed changes

crusaderky added 7 commits June 28, 2021 17:13

fix

1a75761

trivial

afb7642

Client.rebalance to silently ignore missing keys that were not asked for

bfa2b32

Don't send failed keys to client if not asked

3d2c84d

Merge branch 'main' into rebalance_in_compute

3388af0

fix tests

4a48be9

remove unnecessary explicit calls

6aa3a56

crusaderky added 4 commits June 30, 2021 15:59

fix _delete_worker_data

f5d0f22

Merge branch 'main' into rebalance_in_compute

1be371f

relax timeouts

d058fd9

Merge branch 'main' into rebalance_in_compute

780f88f

crusaderky closed this Jul 1, 2021

crusaderky reopened this Jul 1, 2021

crusaderky commented Jul 1, 2021

View reviewed changes

crusaderky added 4 commits July 1, 2021 16:42

Self-review

2e685e3

Merge branch 'main' into rebalance_in_compute

d01f3ac

relax timeouts

19e5d81

Merge remote-tracking branch 'upstream/main' into rebalance_in_compute

a4871dd

crusaderky marked this pull request as ready for review July 2, 2021 12:34

fjetter reviewed Jul 6, 2021

View reviewed changes

crusaderky added 4 commits July 13, 2021 01:34

Merge branch 'main' into rebalance_in_compute

c83d5da

Use Client.scatter instead of manually planting data

299814e

Delete obsolete unused unrelated code

cfe1363

keys on Worker are always strings

5119313

crusaderky commented Jul 13, 2021

View reviewed changes

crusaderky added 3 commits July 13, 2021 12:47

methods never raise

fa77d79

missing-data -> partial-fail

96dbaee

use transitions

74e38bd

fjetter approved these changes Jul 13, 2021

View reviewed changes

crusaderky added 2 commits July 14, 2021 12:14

Merge branch 'main' into rebalance_in_compute

984fa86

move test

26c6cb9

fjetter merged commit 5f01fe6 into dask:main Jul 14, 2021

crusaderky deleted the rebalance_in_compute branch July 14, 2021 13:34

crusaderky added the memory label Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rebalance() resilience to computations #4968

rebalance() resilience to computations #4968

crusaderky commented Jun 24, 2021 •

edited

Loading

mrocklin Jun 25, 2021

crusaderky Jun 28, 2021

crusaderky Jun 28, 2021

fjetter Jun 25, 2021

crusaderky Jun 28, 2021 •

edited

Loading

crusaderky Jul 1, 2021

fjetter Jun 25, 2021

crusaderky Jun 28, 2021

crusaderky Jul 1, 2021

crusaderky commented Jul 2, 2021

fjetter left a comment

fjetter Jul 6, 2021

crusaderky Jul 13, 2021 •

edited

Loading

fjetter Jul 13, 2021

crusaderky Jul 13, 2021

crusaderky commented Jul 13, 2021

crusaderky commented Jul 14, 2021 •

edited

Loading

rebalance() resilience to computations #4968

rebalance() resilience to computations #4968

Conversation

crusaderky commented Jun 24, 2021 • edited Loading

In scope

Out of scope, left to future PRs

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Jun 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jul 2, 2021

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Jul 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jul 13, 2021

crusaderky commented Jul 14, 2021 • edited Loading

crusaderky commented Jun 24, 2021 •

edited

Loading

crusaderky Jun 28, 2021 •

edited

Loading

crusaderky Jul 13, 2021 •

edited

Loading

crusaderky commented Jul 14, 2021 •

edited

Loading