Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache now in WorkerThread in states 4-63 #3690

Merged

Conversation

djspiewak
Copy link
Member

I still want to see numbers which demonstrate the meaning of these types of things in realistic scenarios, but here's a quick draft that takes inspiration from libuv and minimizes the syscall. In particular, this caches the value of nanoTime() for states 4-63 (so, most of the time) and only refreshes it when either stealing, polling the external queue, or when the user evaluates monotonic. This does trade off timer granularity a bit, but it's still strictly better than what we had pre-3.5 (where timers would reenter through the external queue).

Fixes #3677

Comment on lines 706 to 710
case _ =>
// Call all of our expired timers:
val now = System.nanoTime()
var cont = true
while (cont) {
val cb = sleepers.pollFirstIfTriggered(now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization discussed in #3544 (comment) would become even more meaningful after this change, if now may remain constant for several iterations. So crossing a read-barrier on every iteration is pointless 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about that. It's not horribly complicated to do something like that here, it just introduces a bit more state and a bit more branching. I wanted to start small.

@armanbilge armanbilge linked an issue Jun 14, 2023 that may be closed by this pull request
Copy link
Contributor

@durban durban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a comment, otherwise looks reasonable. The hard part is (as you say) knowing what is the effect of this in real life...

@djspiewak
Copy link
Member Author

djspiewak commented Jun 14, 2023

I used a t2.large instance to measure three scenarios: series/3.4.x, series/3.5.x, and this PR (as of the current HEAD). I ran the WorkStealingThreadPool benchmarks (-wi 10 -i 10 -f 1) as well as the timer drift measurements (so we can measure granularity impacts loaded and unloaded). The results are as follows. Tldr is at the bottom.

series/3.4.x

Granularity

Warming up...
Measuring unloaded...
Unloaded overhead: 0.1542258647750001
Measuring heavily loaded...
Loaded overhead: 1.5743448329583334
Measuring unloaded 100x...
Unloaded overhead 100x: 0.11572495005833328
Measuring heavily loaded 100x...
Killed

Benchmarks

[info] Benchmark                                              (size)   Mode  Cnt    Score   Error    Units
[info] WorkStealingBenchmark.alloc                           1000000  thrpt   10    0.917 ± 0.003  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark  1000000  thrpt   10    3.862 ± 0.225  ops/min
[info] WorkStealingBenchmark.runnableScheduling              1000000  thrpt   10  249.878 ± 3.349  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal   1000000  thrpt   10  227.841 ± 0.476  ops/min
[info] WorkStealingBenchmark.scheduling                      1000000  thrpt   10    4.423 ± 0.136  ops/min

series/3.5.x

Granularity

Warming up...
Measuring unloaded...
Unloaded overhead: 0.11346367834166671
Measuring heavily loaded...
Loaded overhead: 2.6429889117916665
Measuring unloaded 100x...
Unloaded overhead 100x: 0.1209033832166666
Measuring heavily loaded 100x...
Loaded overhead: 2.912850151016667

Benchmarks

[info] Benchmark                                              (size)   Mode  Cnt    Score   Error    Units
[info] WorkStealingBenchmark.alloc                           1000000  thrpt   10    0.858 ± 0.001  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark  1000000  thrpt   10    0.832 ± 0.013  ops/min
[info] WorkStealingBenchmark.runnableScheduling              1000000  thrpt   10   17.278 ± 0.180  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal   1000000  thrpt   10  225.010 ± 0.537  ops/min
[info] WorkStealingBenchmark.scheduling                      1000000  thrpt   10    0.883 ± 0.006  ops/min

HEAD of this PR

Granularity

Warming up...
Measuring unloaded...
Unloaded overhead: 0.11598775258333327
Measuring heavily loaded...
Loaded overhead: 0.5139900563083333
Measuring unloaded 100x...
Unloaded overhead 100x: 0.12293377584999998
Measuring heavily loaded 100x...
Loaded overhead: 0.7911155032249999

Benchmarks

[info] Benchmark                                              (size)   Mode  Cnt    Score     Error    Units
[info] WorkStealingBenchmark.alloc                           1000000  thrpt   10    0.922 ±   0.002  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark  1000000  thrpt   10    1.923 ±   1.738  ops/min
[info] WorkStealingBenchmark.runnableScheduling              1000000  thrpt   10   64.816 ±   0.578  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal   1000000  thrpt   10  135.160 ± 121.636  ops/min
[info] WorkStealingBenchmark.scheduling                      1000000  thrpt   10    2.322 ±   2.148  ops/min

Tldr

This PR reduces scheduling overhead by around 2-3x in most fiber-heavy workloads, and around 3.5-4x in pure Runnable workloads. It also improves timer granularity in heavily loaded scenarios by about 5x (which is quite a surprise, since we expected a regression here).

Additionally we saw some interesting edge effects in some of the benchmark runs, most clearly seen in the WorkStealingBenchmark.runnableSchedulingScalaGlobal benchmark run on HEAD, which exhibits a vast standard deviation and a significantly lower mean. The fun thing is that this particular benchmark is entirely unaffected by the changes in any of these branches, since it is simply benchmarking ExecutionContext.global and bypassing Cats Effect altogether! In other words, we're measuring contention within the AWS EC2 hypervisor layer. I'm almost entirely certain that we wouldn't see these types of effects if I had used an EC2 metal instance, but I'm not paying that just to validate a hypothesis. :-P

tldr This PR improves everything pretty dramatically (at least so far as synthetic benchmarks go; jury's still out on real world stuff). We can make things even better, but I think we should absolutely merge and release this once I have addressed @durban's comment.

@durban
Copy link
Contributor

durban commented Jun 15, 2023

I think there is something wrong with CI: some cirrus jobs didn't get scheduled(?) for more than 12 hours...

@armanbilge
Copy link
Member

I think there is something wrong with CI

Cirrus CI is using pre-emptible VMs to be cost-efficient. So under demand machines may not be available.

@djspiewak
Copy link
Member Author

So under demand machines may not be available.

This is one concern I have with expanding things to fs2, unless it's more of a global Cirrus demand and not github org-specific?

@armanbilge
Copy link
Member

armanbilge commented Jun 15, 2023

unless it's more of a global Cirrus demand

It's this one I'm afraid.

Edit: to clarify, besides global demand there are some other limits for OSS. But in FS2 I have only added four very short-running jobs, it is nowhere comparable to the CE matrix.

@wjoel
Copy link
Contributor

wjoel commented Jun 17, 2023

Baseline (@armanbilge's 3.6 pre-release): https://www.techempower.com/benchmarks/#section=test&shareid=59f51434-e97e-4b76-91bb-78ba913a676e&test=plaintext
This PR (but rebased on series/3.x to include the JVM polling): https://www.techempower.com/benchmarks/#section=test&shareid=8e948333-cd1d-4e35-b03b-50d91b9b1491&test=plaintext

tl;dr 15% more reqs/sec on plaintext, small improvement (1-2%) on JSON.

Copy link
Member

@armanbilge armanbilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice stuff here 😃

@djspiewak djspiewak merged commit 55d3faf into typelevel:series/3.5.x Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible performance regression in 3.5.0
4 participants