-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in work stealing resulting in deadlock #5370
Comments
There is another inconsistency but I'm not entirely sure about the consequences, yet. I'm looking at another deadlock with a different cause (without processing->released transition) but something doesn't add up, yet. Anyhow, this is what I know, so far If Edit: If another |
I need to correct the analysis of above. The deadlock is not triggered by a |
We are still seeing deadlocks after merging #5379 but we haven't unvocered on the root cause, yet |
I think this was closed by #5379 |
User observed behaviour: Key is shown as processing on a worker even though no progress is observed. If worker is investigated directly, the worker is unaware of the key itself
The following tries to outline the chain of events leading up to this deadlock. Workers are called
Alice
,Bob
andChuck
.K
is the key/task to be stolen. other highlighted words are either data collections or methods of the workstealing plugin or the TaskState object.I was not able to reproduce it yet, this is purely theoretical and based off the observations in #5366
Race condition:
Scheduler: Transition K to processing
K
isprocessing_on
Alice
K
isstealable
Balance: K is stealable => Maybe steal
steal request to Alice ID:1
stealable
in_flight
w/ victim:Alice
, thief:Chuck
Balance: K not in
stealable
=> SkipScheduler: Transition processing->released
in_flight
stealable
Scheduler: Transition ...->processing
processing_on
Bobstealable
From here there are two scenarios possible which are both buggy although with different severity
Scenario A:
steal-confirm
from Alice ID: 1WorkStealing.in_flight
move_task_confirm
in_flight_occupancy
never readjustedEverything afterwards works as expected with the exception of wrong occupancy
Scenario B:
Balance: K in
stealable
=>maybe_steal
steal request to
Bob
ID: 2stealable
in_flight
; victim:Bob
, thief:Chuck
Response from Alice ID: 1
in_flight
Response from Bob ID: 2
in_flight
move_task_confirm
Bob
confirmed the steal and forgot the task.Chuck
is never assigned asprocessing_on
Chuck
already-computing
messageThe text was updated successfully, but these errors were encountered: