Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Deadlock on pool exit #93

Closed
alugowski opened this issue Jan 10, 2023 · 5 comments
Closed

[BUG] Deadlock on pool exit #93

alugowski opened this issue Jan 10, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@alugowski
Copy link

I'm seeing a deadlock upon thread pool destruction:

Screenshot 2023-01-09 at 8 37 07 PM

Screenshot 2023-01-09 at 8 38 00 PM

My workload creates a thread pool for one expensive operation. Tasks are not created all at once, instead more are created as old ones finish. wait_for_tasks() makes no difference.

There appears to be one worker thread left waiting for the condition variable as the thread is being joined.

I haven't managed to reproduce it with a small program yet. This is one actual use that suffers the issue:
https://github.com/alugowski/fast_matrix_market/blob/main/include/fast_matrix_market/write_body_threads.hpp

That code is exercised by my unit tests. If I loop a unit test a few thousand times then it nearly always deadlocks.

Workaround:
The following hack works around the issue. Replace the worker thread's wait() with a wait_until and a duration. This way it won't deadlock for more than about 50ms. The break condition must be rechecked afterwards.

task_available_cv.wait_until(tasks_lock, std::chrono::system_clock::now() + 50ms, [this] { return !tasks.empty() || !running; });
if (running)
{
    if (tasks.empty()) {
        continue;
    }

Sounds related to #76.

  • CPU model, architecture, # of cores and threads: M1 Pro, 8 cores
  • Operating system: macOS 13
  • Name and version of C++ compiler: Clang 15 from homebrew
  • Thread pool library version: 3.3.0 (light version)
@alugowski alugowski added the bug Something isn't working label Jan 10, 2023
@alugowski
Copy link
Author

The likely cause is that there is a gap in worker() between checking the exit condition and waiting on task_available_cv. If destroy_threads() happens to call task_available_cv.notify_all() at that time then the worker will have missed the notification and will deadlock.

while (running)
{
    std::function<void()> task;
/////// if running set to false and task_available_cv.notify_all() called here then deadlock
    std::unique_lock<std::mutex> tasks_lock(tasks_mutex);
    task_available_cv.wait(tasks_lock, [this] { return !tasks.empty() || !running; });
    if (running)

@bshoshany
Copy link
Owner

Hi @alugowski, thanks for opening this issue. Sorry it took me so long to reply. I'm currently on hiatus from developing this package, since I'm too busy teaching. However, I plan to release a new version in the summer.

It would be very useful if I could reproduce this issue on my own system so I can see what exactly is going on. Can you tell me exactly which test in your fast_matrix_market repository results in a deadlock?

Like I said, I'm currently on hiatus, but will try to take a look if I can find the time. Thanks!

@alugowski
Copy link
Author

@bshoshany no worries, I have a working workaround so I'm not blocked.

The issue is very intermittent and any test can trigger it. I saw it somewhat regularly when I ran the entire test suite, mostly because each run of the test suite would create and destroy several hundred thread pools. With the number of tests I have now, maybe you'd expect to see a freeze half the time? I mean load the project in CLion and "Run All Tests". Note that fast_matrix_market includes my workaround, so if you'd like to experiment there you'll have to restore your stock code in /include/fast_matrix_market/3rdparty/BS_thread_pool_light.hpp.

Before you dig in there, do you have any tests that just create a pool, load it with trivial dummy work, and destroy? And just loop that tens of thousands of times? I imagine that would show the issue as well. My code isn't doing anything weird.

@bshoshany
Copy link
Owner

Thanks for the information. I'll look into it, but like I said, it might take some time. I'll be in touch!

@bshoshany
Copy link
Owner

Closed as resolved by PR #108 (will be included in v3.4.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants