Race condition in work stealing resulting in deadlock #5370

fjetter · 2021-09-29T16:23:40Z

User observed behaviour: Key is shown as processing on a worker even though no progress is observed. If worker is investigated directly, the worker is unaware of the key itself

The following tries to outline the chain of events leading up to this deadlock. Workers are called Alice, Bob and Chuck. K is the key/task to be stolen. other highlighted words are either data collections or methods of the workstealing plugin or the TaskState object.

I was not able to reproduce it yet, this is purely theoretical and based off the observations in #5366

Race condition:

Scheduler: Transition K to processing
- K is processing_on Alice
- K is stealable
Balance: K is stealable => Maybe steal
steal request to Alice ID:1
- K removed from stealable
- K put in_flight w/ victim: Alice, thief: Chuck
Balance: K not in stealable => Skip
Scheduler: Transition processing->released
- K removed from in_flight
- K removed from stealable
Scheduler: Transition ...->processing
- K processing_on Bob
- K added to stealable

From here there are two scenarios possible which are both buggy although with different severity

Scenario A:

steal-confirm from Alice ID: 1
- K not in WorkStealing.in_flight
- raise KeyError and abort move_task_confirm
- BUG in_flight_occupancy never readjusted

Everything afterwards works as expected with the exception of wrong occupancy

Scenario B:

Balance: K in stealable => maybe_steal
steal request to Bob ID: 2
- K removed from stealable
- K put in in_flight; victim: Bob, thief: Chuck
Response from Alice ID: 1
- Pop K from in_flight
- K not processing_on victim (Alice)
- Recalculate occupancy
Response from Bob ID: 2
- K not in in_flight
- raise KeyError and abort move_task_confirm
- BUG
  - Bob confirmed the steal and forgot the task.
  - Scheduler never registers steal confirmation
  - Chuck is never assigned as processing_on
  - Task is never sent to Chuck
  - BUG Following attempts to steal return state: None which result in a already-computing message

The text was updated successfully, but these errors were encountered:

fjetter · 2021-09-29T17:55:04Z

There is another inconsistency but I'm not entirely sure about the consequences, yet. I'm looking at another deadlock with a different cause (without processing->released transition) but something doesn't add up, yet. Anyhow, this is what I know, so far

If _reevaluate_occupancy_worker is called after a steal-request was submitted but before the reponse for that request was received, this repopulates stealable allowing a second, concurrent steal-request for the same key. This second steal-request overwrites the in_flight dict with new information which no longer matches the information of the first request. However, since the victim is the same, this should merely cause the first response to confirm the second request while the second response should simply be ignored. ~~this should not deadlock.~~

Edit:

If another steal-request is issued to another victim before the second response is received, we'll get the same situation outlined in scenario B above. Deadlock!

fjetter · 2021-10-01T12:23:38Z

I need to correct the analysis of above. The deadlock is not triggered by a processing_on != victim condition since this is guaranteed to be working. it is rather a mix of an already-computing response, followed by a rescheduling and a subsequent confirm. At the very least this is the first described scenario. The fix in #5379 should protect us from any concurrency related errors 🤞

fjetter · 2021-10-15T17:24:23Z

We are still seeing deadlocks after merging #5379 but we haven't unvocered on the root cause, yet

fjetter · 2021-10-27T13:48:16Z

I think this was closed by #5379

crusaderky mentioned this issue Sep 29, 2021

Stress test crashes with StreamClosedError #5371

Open

fjetter mentioned this issue Oct 1, 2021

Resolve work stealing deadlock caused by race in move_task_confirm #5379

Merged

3 tasks

fjetter closed this as completed Oct 27, 2021

gjoseph92 mentioned this issue Nov 4, 2021

test_reschedule_concurrent_requests_deadlock flaky #5494

Open

crusaderky mentioned this issue Apr 4, 2022

Migrate ensure_computing transitions to new WorkerState event mechanism - part 2 #6062

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in work stealing resulting in deadlock #5370

Race condition in work stealing resulting in deadlock #5370

fjetter commented Sep 29, 2021

fjetter commented Sep 29, 2021 •

edited

Loading

fjetter commented Oct 1, 2021

fjetter commented Oct 15, 2021 •

edited

Loading

fjetter commented Oct 27, 2021

Race condition in work stealing resulting in deadlock #5370

Race condition in work stealing resulting in deadlock #5370

Comments

fjetter commented Sep 29, 2021

fjetter commented Sep 29, 2021 • edited Loading

fjetter commented Oct 1, 2021

fjetter commented Oct 15, 2021 • edited Loading

fjetter commented Oct 27, 2021

fjetter commented Sep 29, 2021 •

edited

Loading

fjetter commented Oct 15, 2021 •

edited

Loading