Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The tokio::time::timeout() does not return after timeout. #6682

Closed
jtt opened this issue Jul 10, 2024 · 11 comments · Fixed by #6683
Closed

The tokio::time::timeout() does not return after timeout. #6682

jtt opened this issue Jul 10, 2024 · 11 comments · Fixed by #6683
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-runtime Module: tokio/runtime M-time Module: tokio/time

Comments

@jtt
Copy link

jtt commented Jul 10, 2024

Version

We have been able to reproduce this issue with tokio 1.38.0 and main as of c8f3539.
Tokio 1.37.0 does not seem to be affected (but since this is really sensitive to timing, can not be sure).

Platform

  • Linux 6.1.0-22-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21) x86_64 GNU/Linux
  • 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:19:05 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8112 arm64

Description

Tokio 1.38.0 seems to have introduced a bug which affects at least tokio::time::timeout(), but is likely related to tokio::time::sleep::Sleep.

We have a program which calls tokio::time::timeout() in a loop:

        loop {
            let mut wait_time = { ... } // calculate delay  
            if let Some(delay) = wait_time {

                match tokio::time::timeout(delay, rx.recv()).await {
                    Ok(val) => match val {
                        None => {
                            return;
                        }
                        Some((r, cnt)) => {
                           wait_time = { .. } // recalculate delay
                        }
                    },
                    Err(_e) => {
                      wait_time = {..} // recalculate delay
                   },
                }
            } else {
                return;
            }
        }

where rx is a tokio::sync::mpsc::Receiver. After updating to tokio 1.38.0 we observed that sometimes the program would get stuck on timeout(), that is, the timeout().await never returned.

The bug seems to trigger when

a) There has been data on the channel and rx.recv() has returned a value and thus the underlying Sleep has been dropped
b) We have called the timeout() in just the right time and the rx channel is empty. The Duration we call timeout() with varies from few hundred milliseconds to few seconds.

It looks like a some kind on race where we call timeout() just right after the previous call has returned Ok(). Reproducing this sometimes happens after running for few minutes and sometimes the program needs to run for hours before getting stuck. Trying to debug this by adding println!() etc. makes the bug harder to trigger. Same applies to adding debugs into the internals of tokio::runtime::time

What I have observed is that tokio::runtime::time::Driver::park_internal() calculates the expiration_time as None just after we call timeout() and for some reason the newly added Sleep does not affect that and thus the timeout().await never returns.

Unfortunately I have not been able to minimise the program into something that would reliably (or at all) trigger the bug, but please let me know if any additional information is needed and I'll try to provide that.

@jtt jtt added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Jul 10, 2024
@wathenjiang wathenjiang added M-time Module: tokio/time M-runtime Module: tokio/runtime labels Jul 11, 2024
@wathenjiang
Copy link
Contributor

There is not only one reason why runtime get stuck, refer to #6315.
Minimising the program would be great to help us to find the true reason.

@wathenjiang
Copy link
Contributor

Can we get the stack of worker threads when the runtime get stuck?

@jtt
Copy link
Author

jtt commented Jul 11, 2024

Attached is the backtrace of all threads. This one is from M2 MacBook Air with tokio main HEAD. At this point we have only few tasks running, all of them awaiting either to read or write from mpsc channel, except one task which is awaiting the timeout() which never returns.

thread-backtrace.txt

We've also observed that on Linux suspending the program with ctrl + z on terminal and then continuing it with fg %1 causes the timeout() to return and the program runs fine again (until it gets stuck on timeout() again). On Mac OS this does not seem to happen.

@oherrala
Copy link

I've been hunting this issue with @jtt.

Today I ran a test between two consecutive commits of tokio: 1st: e62c3e9 and 2nd: 1914e1e:

[patch.crates-io]
# tokio = { git = "https://github.com/tokio-rs/tokio", rev = "e62c3e92692f795c407beadff580fa1380df5a26" }
tokio = { git = "https://github.com/tokio-rs/tokio", rev = "1914e1e4b9bfe6ea2d61970ec3fcf2b5d7bb0210" }

The hang happens really fast on the 2nd commit. But doesn't appear on 1st commit (but it could of course need more time running). The 2nd commit is the merge of #6534 implementing the sharded locks on Timers. We keep running with 1st commit to see if we can reproduce the hang on it also.

@wathenjiang
Copy link
Contributor

I've been hunting this issue with @jtt.

Today I ran a test between two consecutive commits of tokio: 1st: e62c3e9 and 2nd: 1914e1e:

[patch.crates-io]
# tokio = { git = "https://github.com/tokio-rs/tokio", rev = "e62c3e92692f795c407beadff580fa1380df5a26" }
tokio = { git = "https://github.com/tokio-rs/tokio", rev = "1914e1e4b9bfe6ea2d61970ec3fcf2b5d7bb0210" }

The hang happens really fast on the 2nd commit. But doesn't appear on 1st commit (but it could of course need more time running). The 2nd commit is the merge of #6534 implementing the sharded locks on Timers. We keep running with 1st commit to see if we can reproduce the hang on it also.

Can you change the tokio dependency of tokio to https://github.com/wathenjiang/tokio/tree/update-next-wake-hold-locks. I hope it would help us to find out whether there is some data race issue.

@wathenjiang
Copy link
Contributor

If I understand it correctly, this hang should not be reproduced on this commit.

@oherrala
Copy link

Can you change the tokio dependency of tokio to https://github.com/wathenjiang/tokio/tree/update-next-wake-hold-locks. I hope it would help us to find out whether there is some data race issue.

Thanks! I'll do a test run with this branch.

[patch.crates-io]
# tokio = { git = "https://github.com/tokio-rs/tokio", rev = "e62c3e92692f795c407beadff580fa1380df5a26" }
# tokio = { git = "https://github.com/tokio-rs/tokio", rev = "1914e1e4b9bfe6ea2d61970ec3fcf2b5d7bb0210" }
tokio = { git = "https://github.com/wathenjiang/tokio", branch = "update-next-wake-hold-locks" }

@jtt
Copy link
Author

jtt commented Jul 12, 2024

Can you change the tokio dependency of tokio to https://github.com/wathenjiang/tokio/tree/update-next-wake-hold-locks. I hope it would help us to find out whether there is some data race issue.

We've now been running with this branch for ~20 hours on Windows, Linux and Mac OS and so far we have not observed any issues and things have been running smoothly. We'll keep testing for few days at least and let you know of any issues, but looking good so far.

@wathenjiang
Copy link
Contributor

We've now been running with this branch for ~20 hours on Windows, Linux and Mac OS and so far we have not observed any issues and things have been running smoothly. We'll keep testing for few days at least and let you know of any issues, but looking good so far.

Thank you for your feedback! It seems that there is some data race here. I will try to fix it.

kodiakhq bot pushed a commit to pdylanross/fatigue that referenced this issue Jul 17, 2024
Bumps tokio from 1.38.0 to 1.38.1.

Release notes
Sourced from tokio's releases.

Tokio v1.38.1
1.38.1 (July 16th, 2024)
This release fixes the bug identified as (#6682), which caused timers not
to fire when they should.
Fixed

time: update wake_up while holding all the locks of sharded time wheels (#6683)

#6682: tokio-rs/tokio#6682
#6683: tokio-rs/tokio#6683



Commits

14b9f71 chore: release Tokio v1.38.1 (#6688)
24344df time: fix race condition leading to lost timers (#6683)
See full diff in compare view




Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
@viciousstar
Copy link

Is 1.37.0 affected? Which version introduced this bug?

@oherrala
Copy link

oherrala commented Aug 8, 2024

@viciousstar My understanding is that 1.37.0 is not affected. The bug was introduced in 1.38.0 (with #6534) and then fixed in 1.38.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-runtime Module: tokio/runtime M-time Module: tokio/time
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants