Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokio LongRunningTaskDetector. (https://github.com/tokio-rs/console/issues/150) #57

Closed
wants to merge 15 commits into from

Conversation

zolyfarkas
Copy link

@zolyfarkas zolyfarkas commented Jan 2, 2024

Motivation

The motivation for this pull request is to provide a solution for the problem described at: tokio-rs/console#150

Blocking calls in the async workers lead to scalability issues, being able to detect them so that code can be fixed is very important for the scalability of an async-io service.

Lrtd running in production will detect blocked runtime workers and report them. This allows to emit thread stack trace details that will enable owners to quickly identify the issue and push out a fix, further enhancing the scalability and responsiveness of their services. The blocking event counts can be used to test the health of the canary tier, etc...

Solution

LongRunningTaskDetector uses a sampling black box approach to detect when processing in the tokio runtime in blocked for more than a configurable amount. The approach is simple and reliable, and has very low/configurable overhead, as such is is ideal for continuous running in production environments.

The LongRunningTaskDetector can be started like:

    use std::sync::Arc;
    use tokio_metrics::detectors::LongRunningTaskDetector;

    let (lrtd, mut builder) = LongRunningTaskDetector::new_multi_threaded(
      std::time::Duration::from_millis(10),
      std::time::Duration::from_millis(100)
    );
    let runtime = builder.worker_threads(2).enable_all().build().unwrap();
    let runtime_ref = Arc::new(runtime);
    let lrtd_runtime_ref = arc_runtime.clone();
    lrtd.start(lrtd_runtime_ref);
    runtime_ref.block_on(async {
     print!("my async code")
    });

When blocking is detected the stack traces of the runtime worker threads can be dumped (action is plugable), to allow us to identify what is happening in the worker threads.

On detection, details can contain precise info about the blocking culprit and will look like:

Detected worker blocking, signaling SIGUSR1 worker threads: [123145546047488]
Stack trace for thread "test_blocking_detection_current":123145546047488
   0: std::backtrace_rs::backtrace::libunwind::trace
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1: std::backtrace_rs::backtrace::trace_unsynchronized
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2: std::backtrace::Backtrace::create
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/backtrace.rs:331:13
   3: std::backtrace::Backtrace::force_capture
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/backtrace.rs:313:9
   4: tokio_util::lrtd::lrtd::signal_handler
             at ./src/lrtd/lrtd.rs:221:21
   5: __sigtramp
   6: ___semwait_signal
   7: <unknown>
   8: std::sys::unix::thread::Thread::sleep
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys/unix/thread.rs:241:20
   9: std::thread::sleep
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/thread/mod.rs:872:5
  10: lrtd::run_blocking_stuff::{{closure}}
             at ./tests/lrtd.rs:12:5
  11: lrtd::test_blocking_detection_current::{{closure}}
             at ./tests/lrtd.rs:57:30
  12: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/future/future.rs:125:9
  13: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
....

This is the output from the unit tests (run_blocking_stuff).

Not sure I put this in the right place, and there is likely some polishing left to do, but this should be good enough for some initial feedback.

for context see the previous PR.

src/lib.rs Outdated Show resolved Hide resolved
src/lrtd.rs Outdated Show resolved Hide resolved
src/lrtd.rs Outdated Show resolved Hide resolved
@zolyfarkas zolyfarkas marked this pull request as ready for review January 2, 2024 20:23
src/detectors.rs Outdated Show resolved Hide resolved
src/detectors.rs Outdated Show resolved Hide resolved
@zolyfarkas
Copy link
Author

@hawkw anything else that needs to be done on this PR?

@hawkw
Copy link
Member

hawkw commented Jan 24, 2024

@hawkw anything else that needs to be done on this PR?

I'll give it another review shortly! I think the use of pthread_t in APIs was my biggest concern, and that's been addressed, but I'd like to do a quick pass looking at docs, naming, API polish, etc.

It would be nice to get a second review from another maintainer involved in tokio-metrics, perhaps @Noah-Kennedy or @LucioFranco, if they have the time?

@zolyfarkas
Copy link
Author

@hawkw I fixed I think the check issues in the PR, if you can approve another run that would be great! Also if there is anything else to do let me know. Thanks!

@zolyfarkas
Copy link
Author

@hawkw fixed the rust doc errors, can you kick off the signals workflow? thanks!

@zolyfarkas
Copy link
Author

@hawkw or anyone else, let meknow if there is anything else to do. thank you.

@zolyfarkas
Copy link
Author

@hawkw gentle nudge :-)

@zolyfarkas
Copy link
Author

another gentle nudge, @hawkw , anyone?

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't normally review changes to tokio-metrics, but here is a review.

I mostly have a bunch of comments to your documentation. Have you tried rendering the documentation and seeing how it looks in the browser? That can be really helpful.

You probably need this command to render it:

RUSTDOCFLAGS="--cfg docsrs" RUSTFLAGS="--cfg docsrs" cargo +nightly doc --all-features

Cargo.toml Outdated Show resolved Hide resolved
src/detectors.rs Show resolved Hide resolved
src/detectors.rs Outdated Show resolved Hide resolved
src/detectors.rs Outdated Show resolved Hide resolved
src/detectors.rs Outdated Show resolved Hide resolved
src/detectors.rs Outdated Show resolved Hide resolved
src/detectors.rs Outdated Show resolved Hide resolved
src/detectors.rs Outdated Show resolved Hide resolved
@zolyfarkas
Copy link
Author

@Darksonn Thanks a lot for your feedback! I hope I addressed all your concerns in the latest update.

@zolyfarkas
Copy link
Author

Looks like there is a flaky doc test in runtime.rs:312. (unrelated to this PR)

@zolyfarkas
Copy link
Author

@Noah-Kennedy, @LucioFranco would you also be able to take a look? Thanks!

@zolyfarkas
Copy link
Author

@Darksonn if there is anything else I need to do let me know. thanks!

@zolyfarkas
Copy link
Author

@hawkw or anyone else, let me know if there is anything else to do on this PR. Thanks!

@Darksonn
Copy link
Contributor

I have asked around with the other maintainers, and unfortunately I don't think there is anyone who wants to merge this into tokio-metrics. There are various concerns with the API design being rather invasive for the user, but ultimately I think that this approach is just not the long-term solution we want for detection of stuck tasks.

As for myself, remember that when I closed your previous PR, I said that submitting something to tokio-metrics would "require an entirely different implementation". Carl also said something to the same effect.

I apologize for us starting to review a change that we were not going to accept in the end.

I recommend that you create a new crate of this utility. If you do that, then we are willing to add links in Tokio's documentation to your crate so that other people can more easily discover it.

@Darksonn Darksonn closed this Mar 14, 2024
@zolyfarkas
Copy link
Author

@Darksonn Thank you for your feedback on this work, it made this implementation a lot better.

Regarding "various concerns with the API design being rather invasive for the user" I would be grateful for some specifics on this, and maybe some suggestions on how to make this better. The design I used was limited to current tokio's public API which I did not want to touch.

I will post a followup comment once I publish this in its own crate.

facebook-github-bot pushed a commit to facebookexperimental/rust-shed that referenced this pull request Mar 20, 2024
Summary:
It is time for tokio to have a long running task detector, and help prevent issues like: S292815
The equivalent for python helped eliminate 100s if not thousands of issues and scale 50+ services.

See included README.md for details.
This is the next step after an attempt to make this part of tokio. see [tokio-metrics#57](tokio-rs/tokio-metrics#57) for more historical context

Reviewed By: hodgesds

Differential Revision: D55062668

fbshipit-source-id: 02df88e1215867a7fa2b36346eb046949332b33d
@zolyfarkas
Copy link
Author

For anyone interested in this component, it's next home is: https://github.com/facebookexperimental/rust-shed/tree/main/shed/tokio-detectors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants