-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: keep all tasks_total
member value changes safely wrapped inside the same mutex-protected zone.
#70
Conversation
…e the same mutex-protected zone. The design assumes `tasks_total` is always in sync with the actual state of the queue+running tasks (or APIs like `get_tasks_running()` would be lying to the caller at some point in time) and having this one inside the mutex-protected zone keeps that assumption intact. --- From: SHA-1: 12253b6 * fixes: - push_task(): document which (member) variables are under which mutex' overwatch and make sure the code matches this. + Case in point: the `tasks_total` counter MUST be kept in sync with the actual `tasks` queue size hence it must be managed by the same mutex, or you will have situations where `get_tasks_running()` is lying to you and we CANNOT afford that. + Second case in point: [to-be-sumbmitted]
Thanks for this pull request. So it looks like the only change you're making is moving the line |
That would be extremely hard to trigger; this came out of code review: it's at least theoretically possible to "interrupt" (task switch & change state) between the
Making the above happen in practice is a statistics game as the above faulty behaviour is highly timing dependent. If you can follow and accept the reasoning as proof of correctness, that saves me a metric ton of additional effort, because making this happen dependably when you want it is hard. Pulling that one inside the mutex makes the operation "updating the tasks queue and all that represent its precise actual state" a guaranteed atomic operation. As you know, when you're actively looking for trouble, it can be damn hard to trigger so this is the sort of thing where software proofs (reasoning) have their use and testing alone does not suffice. (Software testing is needed and useful to make sure the software and hardware elements are actually acting according to spec, but a thread pool or anything that incorporates mutexes, etc. ideally come with both a theoretical software proof and a tested working implementation. Because no matter how much you test, you cannot dependably hit 100% of timing scenarios in reasonable time. Regrettably I don't have affordable access to software proof people and machinery, so 'brain' is what's available and has to cope & suffice for the daily job. 😉 And here we are.)
If I don't explain myself inteligibly enough for you, it might be handy for us to find a CS major who is better at formulating proper software proofs on this subject. I'm living this sort of stuff, but I'd have to really brush up on my math notation and correct academic jargon for this as nobody around me ever required their answers from me in that 'language', so I've not exercised that ability (except for reading papers) in about 30 years. My apologies if that makes me sound 'slightly off'. If anyone else likes to chime in, I love to stand corrected on my analysis about Summary: yes, that the only code change. The comment/ddoc bit is as important AFAIAC, because that bit is there to assist in approaching a software proof of correctness as close as I can get it with reasonable (private personal) effort. So here I am, wondering if you're willing to accept reasoning instead of coded PoC as that spares me a lot of additional effort that, for me at least, is a one-off expenditure. Do we have a trade? 😉 |
Thanks for your reply. I don't think a mathematical proof is needed here. This code is simple enough to follow. After thinking about this for a bit, I believe you are correct that it is theoretically better if the number of tasks stored in The reason I put the line However, moving As for the comments, I don't believe they are necessary because there is only one mutex, so obviously that one mutex is responsible for locking everything that needs to be locked. If the class had two mutexes, it would be a different matter, but with only one mutex, such comments are redundant. |
Thank you! On the "there's only one mutex":
^^^ That one is also result from top to bottom code review: I gave I cannot, currently, reason/argue the separate/second mutex for IIRC, I dumped that bit of change in #76, so that's where those comments come in. They helped me review the code, at least, also when it was still a single mutex, because I had a checklist alongside while inspecting and evaluating every line of code, so let's say they've become dear to me. 😉 Now, I think they may assist the next one coming through here like me, but that's all I can argue for them. A matter of taste, 's all. 🤷 Thank you for your perseverance, by the way. Much appreciated. (I've had conversations like these before, some ended in a stallmate or exit before closure (by either). Glad that didn't happen.) You said:
😄 I've learned through very hard (and painful/stressful) experience that nobody I've met can ultimately get away with that choice (choosing performance over theoretical correctness) with anything related to semaphores / multitasking. If it doesn't bite them in the ass soon enough, it'll at least bite the ass of their successor. Painfully. Too often I've arrived at the place where such a choice was made (or they failed to see they were working in this setting and thus made "stupid" (in 20/20 hindsight) design mistakes, for they simply didn't realize they had created a parallel system where now some kind of synchronization was required to complete it all. Been around the block, pharma, military, banking, everybody screws up everywhere. 😅 I try hard every day not to join the club. 😄 |
I don't understand why this is an issue. In
Haha, that's a good lesson indeed. To be clear, when I wrote that code, I didn't realize that it was not theoretically correct; if I did, I wouldn't have done this. I only realized this was incorrect as a result of our discussion.
Sure thing! |
Update: this is now implemented in v3.4.0. |
Describe the changes
fix: keep all
tasks_total
member value changes safely wrapped inside the same mutex-protected zone. The design assumestasks_total
is always in sync with the actual state of the queue+running tasks (or APIs likeget_tasks_running()
would be lying to the caller at some point in time) and having this one inside the mutex-protected zone keeps that assumption intact.From: SHA-1: 12253b6
tasks_total
counter MUST be kept in sync with the actualtasks
queue size hence it must be managed by the same mutex, or you will have situations whereget_tasks_running()
is lying to you and we CANNOT afford that.Testing
This was tested as part of a larger work (other PRs are forthcoming shortly) after hunting down shutdown issues (application lockups, etc.) in a large application.
Tested this code via your provided test code rig; see my own fork and the referenced commits which point into there.
Tested on AMD Ryzen 3700X, 128GB RAM, latest Win10/64, latest MSVC2019 dev environment. Using in-house project files which use a (in-house) standardized set of optimizations.
Additional information
TBD
The patches are hopefully largely self-explanatory. Where deemed useful, the original commit messages from the dev fork have been referenced and included.