Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for flaky worker_steal_count test #6932

Merged
merged 1 commit into from
Oct 28, 2024

Conversation

jofas
Copy link
Contributor

@jofas jofas commented Oct 23, 2024

Fixes #6470. Same approach as in #6916 for a similar flaky test that also relies on blocking the runtime. I tried corroborating the validity of this fix by running this test ~250k times successfully on my local machine, but I was unable to hit even a single retry in those runs.

Comment on lines 326 to 328
let n: u64 = (0..metrics.num_workers())
.map(|i| metrics.worker_steal_count(i))
.sum();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should recreate the runtime on failure. Otherwise a failed round could result in this number being wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess this clould be the case for #6916 too.

Copy link
Contributor Author

@jofas jofas Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess this clould be the case for #6916 too.

I don't think the global_queue_depth in #6916 could be tainted by a task from the previous try to block the runtime. That would require both tasks from the next try to be scheduled before the one unscheduled (but already queued) task of the previous try, which I don't believe can happen? If that could happen, the test would be racy. I'm not exactly sure what guarantees Tokio has for task execution order. Even if my assumption is true, if we are relying on an implementation detail, I think we should recreate the runtime in that test, too, to bullet proof it against future changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the deadlock scenario is what I mentioned here, then one injection queue task will remain in the queue. So I think starting another iteration without refreshing the runtime could taint the queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think starting another iteration without refreshing the runtime could taint the queue.

That would require both tasks from the current try to block the runtime to be executed before the one remaining task from the previous try. Is that possible under the consideration that a task blocks the worker it is executed on?

(Sorry for my ignorance of the implementation details, I've yet to fully grasp the execution model of the multi-threaded runtime)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's probably pretty unlikely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then I should definitely revisit #6916 and recreate the runtime on each try. Should I make a new PR or do the changes here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you. I would probably make a new PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#6936 for reference.

@jofas jofas force-pushed the fix-flaky-worker-steal-count-test branch from a47cc64 to a06fa00 Compare October 24, 2024 13:50
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-metrics Module: tokio/runtime/metrics labels Oct 25, 2024
@jofas jofas force-pushed the fix-flaky-worker-steal-count-test branch from a06fa00 to f44896c Compare October 27, 2024 14:23
Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@Darksonn Darksonn merged commit 4468f27 into tokio-rs:master Oct 28, 2024
81 checks passed
@jofas jofas deleted the fix-flaky-worker-steal-count-test branch October 28, 2024 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-metrics Module: tokio/runtime/metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test worker_steal_count hangs sometimes
3 participants