Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in work stealing resulting in deadlock #5370

Closed
fjetter opened this issue Sep 29, 2021 · 4 comments
Closed

Race condition in work stealing resulting in deadlock #5370

fjetter opened this issue Sep 29, 2021 · 4 comments

Comments

@fjetter
Copy link
Member

fjetter commented Sep 29, 2021

User observed behaviour: Key is shown as processing on a worker even though no progress is observed. If worker is investigated directly, the worker is unaware of the key itself

The following tries to outline the chain of events leading up to this deadlock. Workers are called Alice, Bob and Chuck. K is the key/task to be stolen. other highlighted words are either data collections or methods of the workstealing plugin or the TaskState object.

I was not able to reproduce it yet, this is purely theoretical and based off the observations in #5366

Race condition:

  • Scheduler: Transition K to processing

    • K is processing_on Alice
    • K is stealable
  • Balance: K is stealable => Maybe steal

  • steal request to Alice ID:1

    • K removed from stealable
    • K put in_flight w/ victim: Alice, thief: Chuck
  • Balance: K not in stealable => Skip

  • Scheduler: Transition processing->released

    • K removed from in_flight
    • K removed from stealable
  • Scheduler: Transition ...->processing

    • K processing_on Bob
    • K added to stealable

From here there are two scenarios possible which are both buggy although with different severity

Scenario A:

  • steal-confirm from Alice ID: 1
    • K not in WorkStealing.in_flight
    • raise KeyError and abort move_task_confirm
    • BUG in_flight_occupancy never readjusted

Everything afterwards works as expected with the exception of wrong occupancy

Scenario B:

  • Balance: K in stealable => maybe_steal

  • steal request to Bob ID: 2

    • K removed from stealable
    • K put in in_flight; victim: Bob, thief: Chuck
  • Response from Alice ID: 1

    • Pop K from in_flight
    • K not processing_on victim (Alice)
    • Recalculate occupancy
  • Response from Bob ID: 2

    • K not in in_flight
    • raise KeyError and abort move_task_confirm
    • BUG
      • Bob confirmed the steal and forgot the task.
      • Scheduler never registers steal confirmation
      • Chuck is never assigned as processing_on
      • Task is never sent to Chuck
      • BUG Following attempts to steal return state: None which result in a already-computing message
@fjetter
Copy link
Member Author

fjetter commented Sep 29, 2021

There is another inconsistency but I'm not entirely sure about the consequences, yet. I'm looking at another deadlock with a different cause (without processing->released transition) but something doesn't add up, yet. Anyhow, this is what I know, so far

If _reevaluate_occupancy_worker is called after a steal-request was submitted but before the reponse for that request was received, this repopulates stealable allowing a second, concurrent steal-request for the same key. This second steal-request overwrites the in_flight dict with new information which no longer matches the information of the first request. However, since the victim is the same, this should merely cause the first response to confirm the second request while the second response should simply be ignored. this should not deadlock.

Edit:

If another steal-request is issued to another victim before the second response is received, we'll get the same situation outlined in scenario B above. Deadlock!

@fjetter
Copy link
Member Author

fjetter commented Oct 1, 2021

I need to correct the analysis of above. The deadlock is not triggered by a processing_on != victim condition since this is guaranteed to be working. it is rather a mix of an already-computing response, followed by a rescheduling and a subsequent confirm. At the very least this is the first described scenario. The fix in #5379 should protect us from any concurrency related errors 🤞

@fjetter
Copy link
Member Author

fjetter commented Oct 15, 2021

We are still seeing deadlocks after merging #5379 but we haven't unvocered on the root cause, yet

@fjetter
Copy link
Member Author

fjetter commented Oct 27, 2021

I think this was closed by #5379

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant