Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlocks in condition_variable #689

Merged
merged 2 commits into from
May 17, 2023

Conversation

msimberg
Copy link
Contributor

@msimberg msimberg commented May 17, 2023

This fixes deadlocks that started appearing after #672 which disabled yielding for spinlock.

(One of) the deadlock(s) was the following scenario:

thread 1 thread 2
wait_until
take lock
add self to cv queue
release lock
timed suspend
notify_all
take lock
timed resume attempt to set thread 1 to pending
attempt to take lock fail because thread 1 is active
spin trying to take lock spin waiting for thread 1 to not be active
deadlock deadlock

This PR changes notify_all to not hold the lock while resuming threads that need to be woken up. I see no reason to keep the lock for that time since there is anyway a delay between setting a thread to pending and the thread actually being run by a worker thread, with the latter not happening under a lock already right now. This PR just relaxes that constraint further. It also significantly reduces the time the lock is held in notify_all. I'm quite sure this change is safe but we'll need to continue looking out for failures in CI in case I've missed something.

I've also reverted the change in set_thread_state to never yield from #672. Since the lock in notify_all is no longer held while resuming threads it's again safe to yield in set_thread_state.

I think spurious wakeups were probably possible before this change, but if they weren't they're now definitely possible with pika's condition_variable.

msimberg added 2 commits May 17, 2023 10:09
Only hold lock while swapping the queue of waiting threads into a local
queue. Retake lock if there's an error and the queue has to be spliced
back into the member variable queue.
This change was only introduced to never yield while holding a lock in
condition_variable::notify_all. However, that was relaxed to not hold a
lock while resuming threads so we can again allow yielding in
set_thread_state.
@msimberg msimberg added this to the 0.16.0 milestone May 17, 2023
@msimberg msimberg self-assigned this May 17, 2023
@msimberg msimberg requested review from biddisco and aurianer May 17, 2023 08:49
@msimberg
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request May 17, 2023
@bors
Copy link
Contributor

bors bot commented May 17, 2023

try

Build failed:

@msimberg msimberg force-pushed the condition-variable-deadlocks branch from 73f49cc to 17028a4 Compare May 17, 2023 12:30
@msimberg
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request May 17, 2023
@msimberg
Copy link
Contributor Author

bors try-

@msimberg msimberg force-pushed the condition-variable-deadlocks branch from 17028a4 to 449c0ad Compare May 17, 2023 12:32
@msimberg
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request May 17, 2023
@bors
Copy link
Contributor

bors bot commented May 17, 2023

try

Build failed:

@msimberg msimberg force-pushed the condition-variable-deadlocks branch from 449c0ad to a46a7ce Compare May 17, 2023 13:33
@msimberg
Copy link
Contributor Author

There were zero hangs in condition_variable2 in the last CI run (which repeated tests 10 times), so that is fixed. There was one failure like this though: https://cdash.cscs.ch/test/83488889, which was there already before #672, so we're back to "normal".

@msimberg msimberg marked this pull request as ready for review May 17, 2023 13:35
@msimberg
Copy link
Contributor Author

bors try

bors bot added a commit that referenced this pull request May 17, 2023
@bors
Copy link
Contributor

bors bot commented May 17, 2023

try

Build succeeded!

The publicly hosted instance of bors-ng is deprecated and will go away soon.

If you want to self-host your own instance, instructions are here.
For more help, visit the forum.

If you want to switch to GitHub's built-in merge queue, visit their help page.

@msimberg
Copy link
Contributor Author

bors merge

bors bot added a commit that referenced this pull request May 17, 2023
689: Fix deadlocks in `condition_variable` r=msimberg a=msimberg

This fixes deadlocks that started appearing after #672 which disabled yielding for `spinlock`.

(One of) the deadlock(s) was the following scenario:

| thread 1 | thread 2 |
|-|-|
| wait_until | |
| take lock | |
| add self to cv queue | | 
| release lock | |
| timed suspend | |
| | notify_all |
| | take lock |
| timed resume | attempt to set thread 1 to pending |
| attempt to take lock | fail because thread 1 is active |
| spin trying to take lock | spin waiting for thread 1 to not be active |
| deadlock | deadlock |

This PR changes `notify_all` to not hold the lock while resuming threads that need to be woken up. I see no reason to keep the lock for that time since there is anyway a delay between setting a thread to `pending` and the thread actually being run by a worker thread, with the latter _not_ happening under a lock already right now. This PR just relaxes that constraint further. It also significantly reduces the time the lock is held in `notify_all`. I'm quite sure this change is safe but we'll need to continue looking out for failures in CI in case I've missed something.

I've also reverted the change in `set_thread_state` to never yield from #672. Since the lock in `notify_all` is no longer held while resuming threads it's again safe to yield in `set_thread_state`.

I think spurious wakeups were probably possible before this change, but if they weren't they're now definitely possible with pika's `condition_variable`. 

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
@bors
Copy link
Contributor

bors bot commented May 17, 2023

Build failed:

@msimberg
Copy link
Contributor Author

bors merge

bors bot added a commit that referenced this pull request May 17, 2023
689: Fix deadlocks in `condition_variable` r=msimberg a=msimberg

This fixes deadlocks that started appearing after #672 which disabled yielding for `spinlock`.

(One of) the deadlock(s) was the following scenario:

| thread 1 | thread 2 |
|-|-|
| wait_until | |
| take lock | |
| add self to cv queue | | 
| release lock | |
| timed suspend | |
| | notify_all |
| | take lock |
| timed resume | attempt to set thread 1 to pending |
| attempt to take lock | fail because thread 1 is active |
| spin trying to take lock | spin waiting for thread 1 to not be active |
| deadlock | deadlock |

This PR changes `notify_all` to not hold the lock while resuming threads that need to be woken up. I see no reason to keep the lock for that time since there is anyway a delay between setting a thread to `pending` and the thread actually being run by a worker thread, with the latter _not_ happening under a lock already right now. This PR just relaxes that constraint further. It also significantly reduces the time the lock is held in `notify_all`. I'm quite sure this change is safe but we'll need to continue looking out for failures in CI in case I've missed something.

I've also reverted the change in `set_thread_state` to never yield from #672. Since the lock in `notify_all` is no longer held while resuming threads it's again safe to yield in `set_thread_state`.

I think spurious wakeups were probably possible before this change, but if they weren't they're now definitely possible with pika's `condition_variable`. 

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
@bors
Copy link
Contributor

bors bot commented May 17, 2023

Build failed:

@msimberg
Copy link
Contributor Author

This also happened: https://cdash.cscs.ch/test/83546786. More investigation needed after this is merged.

bors merge

bors bot added a commit that referenced this pull request May 17, 2023
689: Fix deadlocks in `condition_variable` r=msimberg a=msimberg

This fixes deadlocks that started appearing after #672 which disabled yielding for `spinlock`.

(One of) the deadlock(s) was the following scenario:

| thread 1 | thread 2 |
|-|-|
| wait_until | |
| take lock | |
| add self to cv queue | | 
| release lock | |
| timed suspend | |
| | notify_all |
| | take lock |
| timed resume | attempt to set thread 1 to pending |
| attempt to take lock | fail because thread 1 is active |
| spin trying to take lock | spin waiting for thread 1 to not be active |
| deadlock | deadlock |

This PR changes `notify_all` to not hold the lock while resuming threads that need to be woken up. I see no reason to keep the lock for that time since there is anyway a delay between setting a thread to `pending` and the thread actually being run by a worker thread, with the latter _not_ happening under a lock already right now. This PR just relaxes that constraint further. It also significantly reduces the time the lock is held in `notify_all`. I'm quite sure this change is safe but we'll need to continue looking out for failures in CI in case I've missed something.

I've also reverted the change in `set_thread_state` to never yield from #672. Since the lock in `notify_all` is no longer held while resuming threads it's again safe to yield in `set_thread_state`.

I think spurious wakeups were probably possible before this change, but if they weren't they're now definitely possible with pika's `condition_variable`. 

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
@bors
Copy link
Contributor

bors bot commented May 17, 2023

Build failed:

@msimberg
Copy link
Contributor Author

bors merge

bors bot added a commit that referenced this pull request May 17, 2023
689: Fix deadlocks in `condition_variable` r=msimberg a=msimberg

This fixes deadlocks that started appearing after #672 which disabled yielding for `spinlock`.

(One of) the deadlock(s) was the following scenario:

| thread 1 | thread 2 |
|-|-|
| wait_until | |
| take lock | |
| add self to cv queue | | 
| release lock | |
| timed suspend | |
| | notify_all |
| | take lock |
| timed resume | attempt to set thread 1 to pending |
| attempt to take lock | fail because thread 1 is active |
| spin trying to take lock | spin waiting for thread 1 to not be active |
| deadlock | deadlock |

This PR changes `notify_all` to not hold the lock while resuming threads that need to be woken up. I see no reason to keep the lock for that time since there is anyway a delay between setting a thread to `pending` and the thread actually being run by a worker thread, with the latter _not_ happening under a lock already right now. This PR just relaxes that constraint further. It also significantly reduces the time the lock is held in `notify_all`. I'm quite sure this change is safe but we'll need to continue looking out for failures in CI in case I've missed something.

I've also reverted the change in `set_thread_state` to never yield from #672. Since the lock in `notify_all` is no longer held while resuming threads it's again safe to yield in `set_thread_state`.

I think spurious wakeups were probably possible before this change, but if they weren't they're now definitely possible with pika's `condition_variable`. 

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
@bors
Copy link
Contributor

bors bot commented May 17, 2023

Build failed:

@msimberg msimberg merged commit 026e9a8 into pika-org:main May 17, 2023
@msimberg msimberg deleted the condition-variable-deadlocks branch May 17, 2023 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

1 participant