Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: keep all tasks_total member value changes safely wrapped inside the same mutex-protected zone. #70

Closed
wants to merge 1 commit into from

Conversation

GerHobbelt
Copy link

@GerHobbelt GerHobbelt commented Aug 21, 2022

Describe the changes

fix: keep all tasks_total member value changes safely wrapped inside the same mutex-protected zone. The design assumes tasks_total is always in sync with the actual state of the queue+running tasks (or APIs like get_tasks_running() would be lying to the caller at some point in time) and having this one inside the mutex-protected zone keeps that assumption intact.


From: SHA-1: 12253b6

  • fixes:
  • push_task(): document which (member) variables are under which mutex' overwatch and make sure the code matches this.
    • Case in point: the tasks_total counter MUST be kept in sync with the actual tasks queue size hence it must be managed by the same mutex, or you will have situations where get_tasks_running() is lying to you and we CANNOT afford that.
    • Second case in point: [to-be-submitted]

Testing

This was tested as part of a larger work (other PRs are forthcoming shortly) after hunting down shutdown issues (application lockups, etc.) in a large application.

Tested this code via your provided test code rig; see my own fork and the referenced commits which point into there.

Tested on AMD Ryzen 3700X, 128GB RAM, latest Win10/64, latest MSVC2019 dev environment. Using in-house project files which use a (in-house) standardized set of optimizations.

Additional information

TBD

The patches are hopefully largely self-explanatory. Where deemed useful, the original commit messages from the dev fork have been referenced and included.

…e the same mutex-protected zone. The design assumes `tasks_total` is always in sync with the actual state of the queue+running tasks (or APIs like `get_tasks_running()` would be lying to the caller at some point in time) and having this one inside the mutex-protected zone keeps that assumption intact.

---

From: SHA-1: 12253b6

* fixes:
- push_task(): document which (member) variables are under which mutex' overwatch and make sure the code matches this.
  + Case in point: the `tasks_total` counter MUST be kept in sync with the actual `tasks` queue size hence it must be managed by the same mutex, or you will have situations where `get_tasks_running()` is lying to you and we CANNOT afford that.
  + Second case in point: [to-be-sumbmitted]
@bshoshany
Copy link
Owner

Thanks for this pull request. So it looks like the only change you're making is moving the line ++tasks_total; into the lock block in push_task(). I don't see any issues with making this change, but I'm also hesitant to change something this fundamental to the library without making 100% sure that the change is necessary. I have done extensive testing and never found any synchronization issues between the the actual number of tasks in the queue and the value of tasks_total. Are you able to provide a minimal working example demonstrating that such synchronization issues exist?

@GerHobbelt
Copy link
Author

GerHobbelt commented Aug 24, 2022

That would be extremely hard to trigger; this came out of code review: it's at least theoretically possible to "interrupt" (task switch & change state) between the tasks_total update and the queue update itself as the two statements -- as they were -- are a non-atomic operation and therefore [*] cause a (temporary) out-of-synchronization state where tasks_total is not exactly representing the state of the tasks queue (+ number of executing tasks), thus producing [*] slightly incorrect numbers elsewhere in the code where tasks_total is used for condition checks, etc., thus making that part of code [*] misbehave as the .wait conditions implicitly assume tasks_total is always correct about the current state of the queue, i.e. tasks_count + tasks is an atomic unit.

Everywhere where I wrote '[*]' above, it should read '[potentially]'; without that it's more readable describing the fault situation. timing, thus circumstances, decides whether this fault situation actually occurs. Given the distance between the parts, changes are slim to nearly none (but NOT ZERO) for the fault situation to occur.

Making the above happen in practice is a statistics game as the above faulty behaviour is highly timing dependent. If you can follow and accept the reasoning as proof of correctness, that saves me a metric ton of additional effort, because making this happen dependably when you want it is hard.

Pulling that one inside the mutex makes the operation "updating the tasks queue and all that represent its precise actual state" a guaranteed atomic operation.

As you know, when you're actively looking for trouble, it can be damn hard to trigger so this is the sort of thing where software proofs (reasoning) have their use and testing alone does not suffice. (Software testing is needed and useful to make sure the software and hardware elements are actually acting according to spec, but a thread pool or anything that incorporates mutexes, etc. ideally come with both a theoretical software proof and a tested working implementation. Because no matter how much you test, you cannot dependably hit 100% of timing scenarios in reasonable time. Regrettably I don't have affordable access to software proof people and machinery, so 'brain' is what's available and has to cope & suffice for the daily job. 😉 And here we are.)

This is also why I wrote down which variables were managed by which mutex: the assumption of the thread pool design is that tasks & tasks_total are an atomic unit, i.e. all changes to them act like they're an atomic operation. The mutex takes care of that, task_available_cv is the condition variable that drives this atomic change of system state and is therefor managed under the same mutex (task_available_cv.wait(lock)).

Having tasks_total updated outside the mutex-protected atomic operation breaks the atomicity and thus the fundamentally correct operation. The atomicity is correct (in this regard) with this change.

Another theoretical bug is using the same mutex for the counter-traveling CV, hence the need for two mutexes in total: task_done_cv (rather: task_done_cv.wait()) should have its own mutex. But that's not in this PR.

If I don't explain myself inteligibly enough for you, it might be handy for us to find a CS major who is better at formulating proper software proofs on this subject. I'm living this sort of stuff, but I'd have to really brush up on my math notation and correct academic jargon for this as nobody around me ever required their answers from me in that 'language', so I've not exercised that ability (except for reading papers) in about 30 years. My apologies if that makes me sound 'slightly off'. If anyone else likes to chime in, I love to stand corrected on my analysis about tasks_count (or any other part I touched); I can ill afford mistakes so if I'm wrong about something here, that implies I must learn and improve immediately.

Summary: yes, that the only code change. The comment/ddoc bit is as important AFAIAC, because that bit is there to assist in approaching a software proof of correctness as close as I can get it with reasonable (private personal) effort.


So here I am, wondering if you're willing to accept reasoning instead of coded PoC as that spares me a lot of additional effort that, for me at least, is a one-off expenditure. Do we have a trade? 😉

@bshoshany
Copy link
Owner

bshoshany commented Aug 24, 2022

Thanks for your reply. I don't think a mathematical proof is needed here. This code is simple enough to follow.

After thinking about this for a bit, I believe you are correct that it is theoretically better if the number of tasks stored in tasks_total is updated while the mutex is still locked. The situation this avoids is one where the number of running tasks is changed after the lock is unlocked but before the line ++tasks_total; executes. This would happen if another task finishes in the few microseconds between these two lines, which should be very rare but still possible.

The reason I put the line ++tasks_total; outside the lock is performance. When tasks_mutex is locked, the workers cannot pop additional tasks from the queue, so I try to keep the mutexes locked for as little time as possible.

However, moving ++tasks_total; into the lock shouldn't make much of a difference, and it guarantees that everything is perfectly synchronized 100% of the time, so I will make that change in the next release. Thanks for the suggestion!

As for the comments, I don't believe they are necessary because there is only one mutex, so obviously that one mutex is responsible for locking everything that needs to be locked. If the class had two mutexes, it would be a different matter, but with only one mutex, such comments are redundant.

@bshoshany bshoshany closed this Aug 24, 2022
@GerHobbelt
Copy link
Author

Thank you!

On the "there's only one mutex":

Another theoretical bug is using the same mutex for the counter-traveling CV, hence the need for two mutexes in total: task_done_cv (rather: task_done_cv.wait()) should have its own mutex. But that's not in this PR.

^^^ That one is also result from top to bottom code review: I gave task_done_cv its own mutex as a general rule, signals traveling in opposite directions must have independent mutexes or you risk deadlock, where both parties wait for each other's .notify. I did that while going through your code (and looking for any kind of risk that could possibly contribute to the chaos I was debugging in the bigger system then). Here I couldn't counter-argue the general rule, so I kept the second mutex.

I cannot, currently, reason/argue the separate/second mutex for task_done.cv.wait() can be optimized out and still provide a 100% deadlock-riskfree rig for every use, so I myself then choose to stick with the safer option and doublecheck that one for correctness; I cannot come up with a fault situation for that, so currently I'm still in favor of the 2-mutexes setup.

IIRC, I dumped that bit of change in #76, so that's where those comments come in. They helped me review the code, at least, also when it was still a single mutex, because I had a checklist alongside while inspecting and evaluating every line of code, so let's say they've become dear to me. 😉 Now, I think they may assist the next one coming through here like me, but that's all I can argue for them. A matter of taste, 's all. 🤷

Thank you for your perseverance, by the way. Much appreciated. (I've had conversations like these before, some ended in a stallmate or exit before closure (by either). Glad that didn't happen.)

You said:

The reason I put the line ++tasks_total; outside the lock is performance. When tasks_mutex is locked, the workers cannot pop additional tasks from the queue, so I try to keep the mutexes locked for as little time as possible.

😄 I've learned through very hard (and painful/stressful) experience that nobody I've met can ultimately get away with that choice (choosing performance over theoretical correctness) with anything related to semaphores / multitasking. If it doesn't bite them in the ass soon enough, it'll at least bite the ass of their successor. Painfully. Too often I've arrived at the place where such a choice was made (or they failed to see they were working in this setting and thus made "stupid" (in 20/20 hindsight) design mistakes, for they simply didn't realize they had created a parallel system where now some kind of synchronization was required to complete it all. Been around the block, pharma, military, banking, everybody screws up everywhere. 😅 I try hard every day not to join the club. 😄

@bshoshany
Copy link
Owner

Another theoretical bug is using the same mutex for the counter-traveling CV, hence the need for two mutexes in total: task_done_cv (rather: task_done_cv.wait()) should have its own mutex. But that's not in this PR.

I don't understand why this is an issue. In wait_for_tasks(), task_done_cv.wait() uses tasks_lock, but note that std::condition_variable only acquires the lock when it is notified, the lock is released while it's waiting. The reason I used the same mutex is that I don't want additional tasks to be added or removed from the queue while the condition variable is being notified, since it needs to check the condition tasks_total == 0 and I don't want that condition to change between the point where the condition variable awakened and the point where the condition is checked. So I think this is the correct way to do this, but if I'm wrong, please explain why.

I've learned through very hard (and painful/stressful) experience that nobody I've met can ultimately get away with that choice (choosing performance over theoretical correctness) with anything related to semaphores / multitasking.

Haha, that's a good lesson indeed. To be clear, when I wrote that code, I didn't realize that it was not theoretically correct; if I did, I wouldn't have done this. I only realized this was incorrect as a result of our discussion.

Thank you for your perseverance, by the way. Much appreciated.

Sure thing!

@bshoshany
Copy link
Owner

Update: this is now implemented in v3.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants