Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE][ML] Fix possible deadlock in thread pool shutdown #343

Merged

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Dec 14, 2018

There was an error in the shutdown of the thread pool. In particular, although threads process tasks in their queue by preference, if there is a context switch whilst the main thread is adding shutdown messages to another thread which isn't waiting to pop its queue and hasn't yet had its shutdown message added then it could steal another queue's shutdown task. The thread whose shutdown task is stolen will then deadlock waiting for its last message iff its queue was already empty.

This PR changes shutdown to wait until all tasks have been processed before enqueuing shutdown tasks. This ensures that every task is blocked waiting to pop its own queue and will process only its own shutdown message and exit. This also means we get better concurrency draining the queues while shutting down and I tightened up one of the test thresholds as a result. See below.

This issue is difficult to test without inserting waits into both the add shutdown task loop and steal task loop, but I reproduced manually and it has been occurring intermittently in integration tests.

// execute its own shutdown message.
auto notIsEmpty = [](TTaskQueue& queue) { return queue.size() > 0; };
while (std::find_if(m_TaskQueues.begin(), m_TaskQueues.end(), notIsEmpty) !=
m_TaskQueues.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's currently nothing to stop new tasks being scheduled after this test succeeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called in the destructor.

auto notIsEmpty = [](TTaskQueue& queue) { return queue.size() > 0; };
while (std::find_if(m_TaskQueues.begin(), m_TaskQueues.end(), notIsEmpty) !=
m_TaskQueues.end()) {
std::this_thread::sleep_for(std::chrono::microseconds{50});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a code smell to me. We could commit it as a temporary fix to stop CI breaking, but I imagine there is a pattern for shutting down threadpools that doesn't involve sleeping.

Copy link
Contributor Author

@tveasey tveasey Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this a bit...

There is if we modify the queue. I could have special case handling in the queue which allows me to pass a message to notify all consumer conditions (blocked on pop on empty queue) to wake up and exit.

I kind of like the idea of waiting until we've drained the queues. It means we don't complicate the queue and we don't have any extra logic going on all the time whenever we pop or push onto the queue. I don't think we'll need to start and stop thread pools for our use case (other than on main entry and exit). I could probably do this cleaner with a condition variable, but this was a quick fix for the tests on the branch. I'll leave a TODO to tidy this up.

@tveasey
Copy link
Contributor Author

tveasey commented Dec 16, 2018

@droberts195 further to your comments, I reworked this by adding the capability to bind a task to a specific queue. This means each queue will execute exactly one shutdown message independent of scheduling, so will definitely shutdown. I kept in the "drain tasks" before starting to shutdown for the reason mentioned above. I also made a small optimisation which I noticed whilst testing. Can you take another look?

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tveasey tveasey merged commit f60c19d into elastic:feature/analysis-pipeline Dec 16, 2018
@tveasey tveasey deleted the bug/thread-pool-shutdown branch May 1, 2019 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants