-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #2641, #2535, and shutdown_timeout
blocking for no reason
#2649
Conversation
JoinHandle of threads created by the pool are now tracked and properly joined at shutdown. If the thread does not return within the timeout, then it's not joined and left to the OS for cleanup.
In threaded runtime, the unparker now owns a weak reference to the inner data. This breaks the cycle of Arc and properly releases the io driver and its worker threads.
Oh well, I just found through the CI that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't review all parts of this PR, but I have two comments:
As I understand it, this PR is blocked on loom getting a |
@Darksonn Indeed. Also, there's still one test that blocks on Windows that I wasn't able to fix ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks correct to me, although adding a mock Weak
to loom
is, of course, a blocker.
I had a couple minor suggestions.
Hi! Thanks for this work. I apologize for the delay in responding. I've been heads down in a big chunk of work. Regarding the time driver cycle bug. I would prefer avoiding the weak ref as it would add a bunch of upgrades in a hot path. Instead, we can add a |
@emgre Can you look into @carllerche's suggestion? If we don't need weak ref, then loom isn't an issue and we're unblocked. |
…sue-2641 # Conflicts: # tokio/src/runtime/blocking/shutdown.rs
I implemented what @carllerche suggested and it fixes the leak issues we have in our codebase. However, I'm still having trouble with cc @jadamcrain |
@emgre Ah, it looks like there was a It's a little weird that it does compile for you locally --- you wouldn't happen to have a git dependency on |
@hawkw I had |
I think the new version of loom is out now. |
shutdown_timeout
blocking for no reasonshutdown_timeout
blocking for no reason
The CI is now green ✔ The only remaining thing to fix is this test that currently deadlocks on Windows: tokio/tokio/tests/rt_common.rs Line 873 in d5507b3
I worked on it this morning without success 😓 |
I was able to reproduce the issue with the thread-local storage destructor not returning from a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good 👍 thanks for sticking with this. I added some comments, but I think this is getting close.
To fix formatting, run:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! One last final tweak and I think we should be good to go 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with it 👍
Summary: Updating to a newer version of Tokio breaks hgcli. This seems to be happening because Tokio fixed a bug where shutting down your runtime could leak threads (tokio-rs/tokio#2649), which results in the runtime refusing to shutdown if you have a thread waiting to read on stdin (note: I didn't try to back that out specifically to confirm). Unfortunately, we have that in hgcli, because we usually kick off a stdin read before we know if we'll need the data or not. Here's the backtrace for a blocked shutdown P163996251. We were already trying to shut down the runtime and ignore every outstanding work (because we ensure earlier that there is nothing outstanding), so let's just try even harder. Reviewed By: StanislavGlebik Differential Revision: D25952430 fbshipit-source-id: ae3a1413790cf81a5d990220f2dc5d4599219f73
JoinHandle
of its threads. Now, they are kept in a vector and properly joined at shutdown. If the thread does not return within the timeout specified, the thread is not joined and left to cleanup by the OS.rt_common threaded_scheduler_1_thread::runtime_in_thread_local threaded_scheduler_4_threads::runtime_in_thread_local
now blocks indefinitely on Windows for some unknown reason. Upon investigation, it seems like the thread properly returns so my only guess is that there is a deadlock in one of the destructor of variables in the thread local storage. It would explain why it only affects Windows, since on unix systems, the destructors of TLS variables do not seem to get called.Arc
that prevented the IO driver and the worker threads from being freed.RefCell
ofArc
in that part of tokio, and it worries me.shutdown_timeout
would unnecessarily block and leave the threads running when no job was dispatched to the queue. It only needed to notify the threads to wake them up. When dropping the runtime, the notification is sent byDrop
ofThreadPool
so it properly shutdowns.⚠ The testThis was fixed in chore: fix windows CI #2597.tcp_into_split drop_write
blocks indefinitely on Windows for unknown reasons. However, this was not introduced by my changes, I have the same behaviour on themaster
branch.I divided my changes in three commits for easier reading.