Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #2641, #2535, and shutdown_timeout blocking for no reason #2649

Merged
merged 18 commits into from
Jul 29, 2020

Conversation

emgre
Copy link
Contributor

@emgre emgre commented Jul 8, 2020

  • Issue Hello world examples non-deterministrically leak under multi-threaded runtime #2641 was caused by the fact that the thread pool didn't kept the JoinHandle of its threads. Now, they are kept in a vector and properly joined at shutdown. If the thread does not return within the timeout specified, the thread is not joined and left to cleanup by the OS.
    • ⚠ The tests rt_common threaded_scheduler_1_thread::runtime_in_thread_local threaded_scheduler_4_threads::runtime_in_thread_local now blocks indefinitely on Windows for some unknown reason. Upon investigation, it seems like the thread properly returns so my only guess is that there is a deadlock in one of the destructor of variables in the thread local storage. It would explain why it only affects Windows, since on unix systems, the destructors of TLS variables do not seem to get called.
  • Issue Dropping threaded runtime with time and IO enabled results in memory leaks #2535 was caused by a cyclic Arc that prevented the IO driver and the worker threads from being freed.
    • ⚠ My solution was the easiest way to fix it that I could find, but a second look from someone with more knowledge of the codebase is needed. There are a lot of RefCell of Arc in that part of tokio, and it worries me.
  • There was another issue I found while doing these changes, but I don't think it's in the issue tracker. The shutdown_timeout would unnecessarily block and leave the threads running when no job was dispatched to the queue. It only needed to notify the threads to wake them up. When dropping the runtime, the notification is sent by Drop of ThreadPool so it properly shutdowns.
  • ⚠ The test tcp_into_split drop_write blocks indefinitely on Windows for unknown reasons. However, this was not introduced by my changes, I have the same behaviour on the master branch. This was fixed in chore: fix windows CI #2597.

I divided my changes in three commits for easier reading.

emgre added 4 commits July 8, 2020 10:17
JoinHandle of threads created by the pool are now tracked and properly joined at shutdown. If the thread does not return within the timeout, then it's not joined and left to the OS for cleanup.
In threaded runtime, the unparker now owns a weak reference to the inner data. This breaks the cycle of Arc and properly releases the io driver and its worker threads.
@emgre
Copy link
Contributor Author

emgre commented Jul 8, 2020

Oh well, I just found through the CI that loom doesn't have Weak... Also, some tasks failed because crates.io returned 504.

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't review all parts of this PR, but I have two comments:

tokio/src/runtime/blocking/pool.rs Outdated Show resolved Hide resolved
tokio/tests/tcp_into_split.rs Outdated Show resolved Hide resolved
@Darksonn Darksonn added A-tokio Area: The main tokio crate C-enhancement Category: A PR with an enhancement or bugfix. M-runtime Module: tokio/runtime labels Jul 20, 2020
@Darksonn
Copy link
Contributor

As I understand it, this PR is blocked on loom getting a Weak type?

@emgre
Copy link
Contributor Author

emgre commented Jul 20, 2020

@Darksonn Indeed. Also, there's still one test that blocks on Windows that I wasn't able to fix (rt_common threaded_scheduler_1_thread::runtime_in_thread_local)

@Darksonn Darksonn added the S-blocked Status: marked as blocked ❌ on something else such as a PR or other implementation work. label Jul 20, 2020
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks correct to me, although adding a mock Weak to loom is, of course, a blocker.

I had a couple minor suggestions.

tokio/tests/tcp_into_split.rs Outdated Show resolved Hide resolved
tokio/src/runtime/blocking/shutdown.rs Outdated Show resolved Hide resolved
tokio/src/runtime/blocking/pool.rs Outdated Show resolved Hide resolved
@carllerche
Copy link
Member

Hi! Thanks for this work. I apologize for the delay in responding. I've been heads down in a big chunk of work.

Regarding the time driver cycle bug. I would prefer avoiding the weak ref as it would add a bunch of upgrades in a hot path. Instead, we can add a shutdown(&mut self) fn to Park (here). For the time driver, this would be implemented by dropping all entries in the ring, which, in turn will drop the wakers. This should break the cycle.

@jadamcrain
Copy link

@emgre Can you look into @carllerche's suggestion? If we don't need weak ref, then loom isn't an issue and we're unblocked.

@emgre
Copy link
Contributor Author

emgre commented Jul 23, 2020

I implemented what @carllerche suggested and it fixes the leak issues we have in our codebase. However, I'm still having trouble with loom. I think it's missing a unsafe impl<T> Send for JoinHandle<T> {}, but I was not able to reproduce what the CI does. Running RUSTFLAGS="--cfg loom --cfg tokio_unstable" cargo test --lib --release --features "full" -- --nocapture loom_pool::group_a on my machine successfully compiles...

cc @jadamcrain

@hawkw
Copy link
Member

hawkw commented Jul 23, 2020

I implemented what @carllerche suggested and it fixes the leak issues we have in our codebase. However, I'm still having trouble with loom. I think it's missing a unsafe impl<T> Send for JoinHandle<T> {}, but I was not able to reproduce what the CI does. Running RUSTFLAGS="--cfg loom --cfg tokio_unstable" cargo test --lib --release --features "full" -- --nocapture loom_pool::group_a on my machine successfully compiles...

cc @jadamcrain

@emgre Ah, it looks like there was a loom PR to add missing Send + Sync impls for the loom version of JoinHandle (tokio-rs/loom#145) that has merged but hasn't been released yet. I'll put together a new loom release, which should hopefully fix that and we can get this PR merged!

It's a little weird that it does compile for you locally --- you wouldn't happen to have a git dependency on loom or something, would you?

@emgre
Copy link
Contributor Author

emgre commented Jul 23, 2020

@hawkw I had loom checked out locally on the master branch when I tried it, so it had the changes you pointed to 🤦‍♂️

@Darksonn
Copy link
Contributor

I think the new version of loom is out now.

@emgre emgre changed the title [WIP] Fix #2641, #2535, and shutdown_timeout blocking for no reason Fix #2641, #2535, and shutdown_timeout blocking for no reason Jul 27, 2020
@emgre
Copy link
Contributor Author

emgre commented Jul 27, 2020

The CI is now green ✔

The only remaining thing to fix is this test that currently deadlocks on Windows:

fn runtime_in_thread_local() {

I worked on it this morning without success 😓

@emgre
Copy link
Contributor Author

emgre commented Jul 28, 2020

I was able to reproduce the issue with the thread-local storage destructor not returning from a thread::join() and I'm now convinced that it's a bug in libstd, not tokio. I opened this issue: rust-lang/rust#74875

Copy link
Member

@carllerche carllerche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good 👍 thanks for sticking with this. I added some comments, but I think this is getting close.

tokio/tests/rt_common.rs Show resolved Hide resolved
tokio/src/runtime/blocking/pool.rs Outdated Show resolved Hide resolved
tokio/src/runtime/blocking/pool.rs Outdated Show resolved Hide resolved
@carllerche
Copy link
Member

To fix formatting, run:

rustfmt --check --edition 2018 $(find . -name '*.rs' -print)

Copy link
Member

@carllerche carllerche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One last final tweak and I think we should be good to go 👍

tokio/src/runtime/blocking/pool.rs Outdated Show resolved Hide resolved
Copy link
Member

@carllerche carllerche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with it 👍

@carllerche carllerche merged commit 646fbae into tokio-rs:master Jul 29, 2020
@Darksonn Darksonn removed the S-blocked Status: marked as blocked ❌ on something else such as a PR or other implementation work. label Nov 10, 2020
@Darksonn Darksonn mentioned this pull request Nov 10, 2020
facebook-github-bot pushed a commit to facebook/sapling that referenced this pull request Jan 20, 2021
Summary:
Updating to a newer version of Tokio breaks hgcli.

This seems to be happening because Tokio fixed a bug where shutting down your
runtime could leak threads (tokio-rs/tokio#2649), which
results in the runtime refusing to shutdown if you have a thread waiting to
read on stdin (note: I didn't try to back that out specifically to confirm).

Unfortunately, we have that in hgcli, because we usually kick off a stdin read
before we know if we'll need the data or not.

Here's the backtrace for a blocked shutdown P163996251.

We  were already trying to shut down the runtime and ignore every outstanding
work (because we ensure earlier that there is nothing outstanding), so let's
just try even harder.

Reviewed By: StanislavGlebik

Differential Revision: D25952430

fbshipit-source-id: ae3a1413790cf81a5d990220f2dc5d4599219f73
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-enhancement Category: A PR with an enhancement or bugfix. M-runtime Module: tokio/runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants