-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE][ML] Fix possible deadlock in thread pool shutdown #343
[FEATURE][ML] Fix possible deadlock in thread pool shutdown #343
Conversation
lib/core/CStaticThreadPool.cc
Outdated
// execute its own shutdown message. | ||
auto notIsEmpty = [](TTaskQueue& queue) { return queue.size() > 0; }; | ||
while (std::find_if(m_TaskQueues.begin(), m_TaskQueues.end(), notIsEmpty) != | ||
m_TaskQueues.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's currently nothing to stop new tasks being scheduled after this test succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only called in the destructor.
lib/core/CStaticThreadPool.cc
Outdated
auto notIsEmpty = [](TTaskQueue& queue) { return queue.size() > 0; }; | ||
while (std::find_if(m_TaskQueues.begin(), m_TaskQueues.end(), notIsEmpty) != | ||
m_TaskQueues.end()) { | ||
std::this_thread::sleep_for(std::chrono::microseconds{50}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a code smell to me. We could commit it as a temporary fix to stop CI breaking, but I imagine there is a pattern for shutting down threadpools that doesn't involve sleeping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this a bit...
There is if we modify the queue. I could have special case handling in the queue which allows me to pass a message to notify all consumer conditions (blocked on pop on empty queue) to wake up and exit.
I kind of like the idea of waiting until we've drained the queues. It means we don't complicate the queue and we don't have any extra logic going on all the time whenever we pop or push onto the queue. I don't think we'll need to start and stop thread pools for our use case (other than on main entry and exit). I could probably do this cleaner with a condition variable, but this was a quick fix for the tests on the branch. I'll leave a TODO to tidy this up.
@droberts195 further to your comments, I reworked this by adding the capability to bind a task to a specific queue. This means each queue will execute exactly one shutdown message independent of scheduling, so will definitely shutdown. I kept in the "drain tasks" before starting to shutdown for the reason mentioned above. I also made a small optimisation which I noticed whilst testing. Can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was an error in the shutdown of the thread pool. In particular, although threads process tasks in their queue by preference, if there is a context switch whilst the main thread is adding shutdown messages to another thread which isn't waiting to pop its queue and hasn't yet had its shutdown message added then it could steal another queue's shutdown task. The thread whose shutdown task is stolen will then deadlock waiting for its last message iff its queue was already empty.
This PR changes shutdown to wait until all tasks have been processed before enqueuing shutdown tasks. This ensures that every task is blocked waiting to pop its own queue and will process only its own shutdown message and exit. This also means we get better concurrency draining the queues while shutting down and I tightened up one of the test thresholds as a result.See below.This issue is difficult to test without inserting waits into both the add shutdown task loop and steal task loop, but I reproduced manually and it has been occurring intermittently in integration tests.