-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for flaky worker_steal_count test #6932
Fix for flaky worker_steal_count test #6932
Conversation
tokio/tests/rt_unstable_metrics.rs
Outdated
let n: u64 = (0..metrics.num_workers()) | ||
.map(|i| metrics.worker_steal_count(i)) | ||
.sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should recreate the runtime on failure. Otherwise a failed round could result in this number being wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess this clould be the case for #6916 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess this clould be the case for #6916 too.
I don't think the global_queue_depth
in #6916 could be tainted by a task from the previous try to block the runtime. That would require both tasks from the next try to be scheduled before the one unscheduled (but already queued) task of the previous try, which I don't believe can happen? If that could happen, the test would be racy. I'm not exactly sure what guarantees Tokio has for task execution order. Even if my assumption is true, if we are relying on an implementation detail, I think we should recreate the runtime in that test, too, to bullet proof it against future changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the deadlock scenario is what I mentioned here, then one injection queue task will remain in the queue. So I think starting another iteration without refreshing the runtime could taint the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think starting another iteration without refreshing the runtime could taint the queue.
That would require both tasks from the current try to block the runtime to be executed before the one remaining task from the previous try. Is that possible under the consideration that a task blocks the worker it is executed on?
(Sorry for my ignorance of the implementation details, I've yet to fully grasp the execution model of the multi-threaded runtime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it's probably pretty unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then I should definitely revisit #6916 and recreate the runtime on each try. Should I make a new PR or do the changes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you. I would probably make a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#6936 for reference.
a47cc64
to
a06fa00
Compare
a06fa00
to
f44896c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Fixes #6470. Same approach as in #6916 for a similar flaky test that also relies on blocking the runtime. I tried corroborating the validity of this fix by running this test ~250k times successfully on my local machine, but I was unable to hit even a single retry in those runs.