Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle spurious wake ups in BarrierSemaphore #82

Closed
wants to merge 1 commit into from

Conversation

rkaminsk
Copy link
Member

This PR updates the BarrierSemaphore class to correctly handle spurious wake ups of the wait function of the condition variable. One could combine the two counters into one but this probably does not matter performance wise because there is just one object per clasp instance.

TODO: I have seen that there are further usages of conditional variables in the code which might exhibit the same problem. I'ld could have a look to but would like to know what you think about this patch first.

@@ -127,21 +134,29 @@ class BarrierSemaphore {
void unsafe_reset(uint32 semCount) {
int prev = counter_;
counter_ = semCount;
if (prev < 0) { semCond_.notify_all(); }
if (prev < 0) {
notify_ = wait_;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the mutex be unlocked before notification?

@BenKaufmann
Copy link
Contributor

@rkaminsk Thanks for the PR. Looks like the BarrierSemaphoreclass is indeed broken. However, I don't think your change is a complete fix.
Assume the following: 3 Threads, Counter = 0

  • Thread 0 calls down -> counter = -1, wait = 1, notify = 0
  • Thread 1 calls wait -> counter = -2, wait = 2, notify = 0
  • Thread 2 calls up -> counter = -1, notify = 1, notify_one

This could wake up Thread 1, which would now return from the wait although the barrier has never been cleared.

Anyhow, now that I have looked at this again, I think the whole class is unnecessary and misguided. So, instead of fixing BarrierSemaphore, I'd prefer dropping it in favor of a less ambitious approach:

I drafted something here: https://github.com/potassco/clasp/tree/issue-80

Could you maybe check whether this works for you?

@rkaminsk
Copy link
Member Author

@rkaminsk Thanks for the PR. Looks like the BarrierSemaphoreclass is indeed broken. However, I don't think your change is a complete fix. Assume the following: 3 Threads, Counter = 0

* Thread 0 calls `down` -> counter = -1, wait = 1, notify = 0

* Thread 1 calls `wait` -> counter = -2, wait = 2, notify = 0

* Thread 2 calls `up` -> counter = -1, notify = 1, `notify_one`

This could wake up Thread 1, which would now return from the wait although the barrier has never been cleared.

I only have a general idea how the parallel solving works. I don't know what the points are when synchronization requests are happening. My assumptions was that spurious wake ups were not handled correctly. This sounds rather like a logical error. It did not happen in my tests.

Anyhow, now that I have looked at this again, I think the whole class is unnecessary and misguided. So, instead of fixing BarrierSemaphore, I'd prefer dropping it in favor of a less ambitious approach:

I drafted something here: https://github.com/potassco/clasp/tree/issue-80

Could you maybe check whether this works for you?

It runs fine in my virtual machine with your patch.

@BenKaufmann
Copy link
Contributor

I only have a general idea how the parallel solving works. I don't know what the points are when synchronization requests are happening. My assumptions was that spurious wake ups were not handled correctly. This sounds rather like a logical error. It did not happen in my tests.

I'm sorry if my answer sounded like criticism. It was not meant like this. Your investigation is spot on and greatly appreciated as it would probably have taken me another couple of weeks to dig deep enough. I'm also sure that your suggested fix would work in this context. It's just that your investigation made me realize that the existing BarrierSemaphore class is a failed abstraction that is too specific to be usable in a general context and too general for this particular context, which actually doesn't need a semaphore at all.

I'll close this PR now but thanks again for solving the actual issue 👍

@rkaminsk
Copy link
Member Author

rkaminsk commented Jul 24, 2022

I'm sorry if my answer sounded like criticism. It was not meant like this. Your investigation is spot on and greatly appreciated as it would probably have taken me another couple of weeks to dig deep enough. I'm also sure that your suggested fix would work in this context. It's just that your investigation made me realize that the existing BarrierSemaphore class is a failed abstraction that is too specific to be usable in a general context and too general for this particular context, which actually doesn't need a semaphore at all.

No worries. No criticism or upset feelings. I just did not expect a logical error from you (meant as high praise). But then parallel code is super hard. 😉

Thanks for taking the time to fix this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants