Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix batched queue starvation #2264

Merged
merged 8 commits into from
Aug 27, 2021
Merged

Fix batched queue starvation #2264

merged 8 commits into from
Aug 27, 2021

Conversation

vasilmkd
Copy link
Member

Fixes #2263.

@djspiewak
Copy link
Member

djspiewak commented Aug 26, 2021

The second (edit now removed) commit trades away some fairness for some throughput (about 12% more). With that commit, the bug fix results in an 8% reduction in the scheduler benchmark; without the fairness tradeoff, it's about 15%. Not too bad, honestly. I could go either way on the overflow ticks. It's always going to be possible to improve throughput by bumping up that coefficient, but it's probably safer to keep it lower.

@vasilmkd
Copy link
Member Author

vasilmkd commented Aug 26, 2021

I pushed a bogus timeout for the test just to see if it passes at all.

Edit: reverted.

@djspiewak
Copy link
Member

djspiewak commented Aug 27, 2021

This represents a roughly 21% reduction in performance from series/3.x on the scheduler benchmark (there was more thermal throttling when I ran against this PR than there was when I ran against series/3.x, so take those numbers as more of an upper bound; Vasil got a 17.5% reduction in his testing). Lots of ways it can be improved though. Let's get the fix in first and then work on the refinements.

@vasilmkd
Copy link
Member Author

Before:

Benchmark                          (size)   Mode  Cnt   Score   Error    Units
WorkStealingBenchmark.scheduling  1000000  thrpt   20  13.567 ± 0.360  ops/min

After:

Benchmark                          (size)   Mode  Cnt   Score   Error    Units
WorkStealingBenchmark.scheduling  1000000  thrpt   20  11.165 ± 0.293  ops/min

@vasilmkd
Copy link
Member Author

I willingly sacrificed some performance here for a quicker turnaround of the fix. We're working on tuning certain parameters of the runtime (necessary tuning which we never really found the time for, but now we'll do it properly) and we should regain some of the performance loss of this PR.

@vasilmkd vasilmkd marked this pull request as ready for review August 27, 2021 16:01
@vasilmkd vasilmkd requested a review from djspiewak August 27, 2021 16:02
Copy link
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random note: batch size is 128, and we check the batched queue once every 128 iterations. However, we may progress as few as 57 fibers every 64 iterations, and we may have also taken from the external queue, meaning that we could have only evaluated 57 + 57 - 1 = 113 fibers by the time we go back to the batched queue, which may in turn result in 128 of our local fibers getting dumped to the tail of the tail of the batched queue. The overlap here is 128 - 113 = 15 fibers. These fibers experience unfairness (though not indefinite unfairness, since they'll be at the front of the batch the next time through, meaning that they are guaranteed to run).

This is a pathological edge case and will be improved by subsequent iterations. It's fine for now.

@djspiewak djspiewak merged commit 0b12f95 into typelevel:series/3.x Aug 27, 2021
@senia-psm
Copy link

There might be an issue with this fix. Please see #2269

@vasilmkd vasilmkd deleted the batched-starvation branch August 28, 2021 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The LocalQueue can starve the batchedQueue
3 participants