Cache `now` in `WorkerThread` in states 4-63 #3690

djspiewak · 2023-06-13T19:15:27Z

I still want to see numbers which demonstrate the meaning of these types of things in realistic scenarios, but here's a quick draft that takes inspiration from libuv and minimizes the syscall. In particular, this caches the value of nanoTime() for states 4-63 (so, most of the time) and only refreshes it when either stealing, polling the external queue, or when the user evaluates monotonic. This does trade off timer granularity a bit, but it's still strictly better than what we had pre-3.5 (where timers would reenter through the external queue).

Fixes #3677

armanbilge · 2023-06-13T19:31:24Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

        case _ =>
          // Call all of our expired timers:
-          val now = System.nanoTime()
          var cont = true
          while (cont) {
            val cb = sleepers.pollFirstIfTriggered(now)


The optimization discussed in #3544 (comment) would become even more meaningful after this change, if now may remain constant for several iterations. So crossing a read-barrier on every iteration is pointless 😅

Yeah I thought about that. It's not horribly complicated to do something like that here, it just introduces a bit more state and a bit more branching. I wanted to start small.

core/jvm/src/main/scala/cats/effect/unsafe/SchedulerCompanionPlatform.scala

durban

I've left a comment, otherwise looks reasonable. The hard part is (as you say) knowing what is the effect of this in real life...

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

djspiewak · 2023-06-14T22:53:13Z

I used a t2.large instance to measure three scenarios: series/3.4.x, series/3.5.x, and this PR (as of the current HEAD). I ran the WorkStealingThreadPool benchmarks (-wi 10 -i 10 -f 1) as well as the timer drift measurements (so we can measure granularity impacts loaded and unloaded). The results are as follows. Tldr is at the bottom.

series/3.4.x

Granularity

Warming up...
Measuring unloaded...
Unloaded overhead: 0.1542258647750001
Measuring heavily loaded...
Loaded overhead: 1.5743448329583334
Measuring unloaded 100x...
Unloaded overhead 100x: 0.11572495005833328
Measuring heavily loaded 100x...
Killed

Benchmarks

[info] Benchmark                                              (size)   Mode  Cnt    Score   Error    Units
[info] WorkStealingBenchmark.alloc                           1000000  thrpt   10    0.917 ± 0.003  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark  1000000  thrpt   10    3.862 ± 0.225  ops/min
[info] WorkStealingBenchmark.runnableScheduling              1000000  thrpt   10  249.878 ± 3.349  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal   1000000  thrpt   10  227.841 ± 0.476  ops/min
[info] WorkStealingBenchmark.scheduling                      1000000  thrpt   10    4.423 ± 0.136  ops/min

series/3.5.x

Granularity

Warming up...
Measuring unloaded...
Unloaded overhead: 0.11346367834166671
Measuring heavily loaded...
Loaded overhead: 2.6429889117916665
Measuring unloaded 100x...
Unloaded overhead 100x: 0.1209033832166666
Measuring heavily loaded 100x...
Loaded overhead: 2.912850151016667

Benchmarks

[info] Benchmark                                              (size)   Mode  Cnt    Score   Error    Units
[info] WorkStealingBenchmark.alloc                           1000000  thrpt   10    0.858 ± 0.001  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark  1000000  thrpt   10    0.832 ± 0.013  ops/min
[info] WorkStealingBenchmark.runnableScheduling              1000000  thrpt   10   17.278 ± 0.180  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal   1000000  thrpt   10  225.010 ± 0.537  ops/min
[info] WorkStealingBenchmark.scheduling                      1000000  thrpt   10    0.883 ± 0.006  ops/min

`HEAD` of this PR

Granularity

Warming up...
Measuring unloaded...
Unloaded overhead: 0.11598775258333327
Measuring heavily loaded...
Loaded overhead: 0.5139900563083333
Measuring unloaded 100x...
Unloaded overhead 100x: 0.12293377584999998
Measuring heavily loaded 100x...
Loaded overhead: 0.7911155032249999

Benchmarks

[info] Benchmark                                              (size)   Mode  Cnt    Score     Error    Units
[info] WorkStealingBenchmark.alloc                           1000000  thrpt   10    0.922 ±   0.002  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark  1000000  thrpt   10    1.923 ±   1.738  ops/min
[info] WorkStealingBenchmark.runnableScheduling              1000000  thrpt   10   64.816 ±   0.578  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal   1000000  thrpt   10  135.160 ± 121.636  ops/min
[info] WorkStealingBenchmark.scheduling                      1000000  thrpt   10    2.322 ±   2.148  ops/min

Tldr

This PR reduces scheduling overhead by around 2-3x in most fiber-heavy workloads, and around 3.5-4x in pure Runnable workloads. It also improves timer granularity in heavily loaded scenarios by about 5x (which is quite a surprise, since we expected a regression here).

Additionally we saw some interesting edge effects in some of the benchmark runs, most clearly seen in the WorkStealingBenchmark.runnableSchedulingScalaGlobal benchmark run on HEAD, which exhibits a vast standard deviation and a significantly lower mean. The fun thing is that this particular benchmark is entirely unaffected by the changes in any of these branches, since it is simply benchmarking ExecutionContext.global and bypassing Cats Effect altogether! In other words, we're measuring contention within the AWS EC2 hypervisor layer. I'm almost entirely certain that we wouldn't see these types of effects if I had used an EC2 metal instance, but I'm not paying that just to validate a hypothesis. :-P

tldr This PR improves everything pretty dramatically (at least so far as synthetic benchmarks go; jury's still out on real world stuff). We can make things even better, but I think we should absolutely merge and release this once I have addressed @durban's comment.

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

durban · 2023-06-15T12:05:33Z

I think there is something wrong with CI: some cirrus jobs didn't get scheduled(?) for more than 12 hours...

armanbilge · 2023-06-15T14:37:31Z

I think there is something wrong with CI

Cirrus CI is using pre-emptible VMs to be cost-efficient. So under demand machines may not be available.

djspiewak · 2023-06-15T15:56:36Z

So under demand machines may not be available.

This is one concern I have with expanding things to fs2, unless it's more of a global Cirrus demand and not github org-specific?

armanbilge · 2023-06-15T15:57:17Z

unless it's more of a global Cirrus demand

It's this one I'm afraid.

Edit: to clarify, besides global demand there are some other limits for OSS. But in FS2 I have only added four very short-running jobs, it is nowhere comparable to the CE matrix.

wjoel · 2023-06-17T10:16:50Z

Baseline (@armanbilge's 3.6 pre-release): https://www.techempower.com/benchmarks/#section=test&shareid=59f51434-e97e-4b76-91bb-78ba913a676e&test=plaintext
This PR (but rebased on series/3.x to include the JVM polling): https://www.techempower.com/benchmarks/#section=test&shareid=8e948333-cd1d-4e35-b03b-50d91b9b1491&test=plaintext

tl;dr 15% more reqs/sec on plaintext, small improvement (1-2%) on JSON.

armanbilge

Really nice stuff here 😃

Cache now in WorkerThread in states 4-63

469e402

armanbilge reviewed Jun 13, 2023

View reviewed changes

core/jvm/src/main/scala/cats/effect/unsafe/SchedulerCompanionPlatform.scala Outdated Show resolved Hide resolved

Corrected location of implementation

dc1ede9

armanbilge linked an issue Jun 14, 2023 that may be closed by this pull request

Possible performance regression in 3.5.0 #3677

Closed

durban reviewed Jun 14, 2023

View reviewed changes

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala Outdated Show resolved Hide resolved

djspiewak added 2 commits June 14, 2023 18:01

Improved granularity of timer handling around park loops

8bb1df7

Update now on IO.sleep

77654e1

durban approved these changes Jun 14, 2023

View reviewed changes

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala Show resolved Hide resolved

armanbilge reviewed Jun 15, 2023

View reviewed changes

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala Show resolved Hide resolved

armanbilge approved these changes Jun 22, 2023

View reviewed changes

djspiewak merged commit 55d3faf into typelevel:series/3.5.x Jun 22, 2023

SethTisue mentioned this pull request Jul 7, 2023

Possible performance regression in 3.5.0 #3677

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache `now` in `WorkerThread` in states 4-63 #3690

Cache `now` in `WorkerThread` in states 4-63 #3690

djspiewak commented Jun 13, 2023

armanbilge Jun 13, 2023

djspiewak Jun 13, 2023

durban left a comment

djspiewak commented Jun 14, 2023 •

edited

Loading

durban commented Jun 15, 2023

armanbilge commented Jun 15, 2023

djspiewak commented Jun 15, 2023

armanbilge commented Jun 15, 2023 •

edited

Loading

wjoel commented Jun 17, 2023

armanbilge left a comment

Cache now in WorkerThread in states 4-63 #3690

Cache now in WorkerThread in states 4-63 #3690

Conversation

djspiewak commented Jun 13, 2023

armanbilge Jun 13, 2023

Choose a reason for hiding this comment

djspiewak Jun 13, 2023

Choose a reason for hiding this comment

durban left a comment

Choose a reason for hiding this comment

djspiewak commented Jun 14, 2023 • edited Loading

series/3.4.x

Granularity

Benchmarks

series/3.5.x

Granularity

Benchmarks

HEAD of this PR

Granularity

Benchmarks

Tldr

durban commented Jun 15, 2023

armanbilge commented Jun 15, 2023

djspiewak commented Jun 15, 2023

armanbilge commented Jun 15, 2023 • edited Loading

wjoel commented Jun 17, 2023

armanbilge left a comment

Choose a reason for hiding this comment

Cache `now` in `WorkerThread` in states 4-63 #3690

Cache `now` in `WorkerThread` in states 4-63 #3690

djspiewak commented Jun 14, 2023 •

edited

Loading

`HEAD` of this PR

armanbilge commented Jun 15, 2023 •

edited

Loading