Shared timers #3499

durban · 2023-03-15T23:59:31Z

What is this

This PR implements timer stealing. Instead of each WorkerThread having a thread-local SleepersQueue (as in current series/3.x), with this PR each worker has a thread-safe concurrent skip list of sleepers (TimerSkipList, based on the JSR-166 ConcurrentSkipListMap). By default each worker uses its own skip list, but they can steal from each other. The skip lists also enable simplifying external timer insertion (because we can just choose a skip list, and insert the callback). Finally, a cancelled timer is now really removed from the skip list (unlike currently from SleepersQueue).

Why not reuse java.util.concurrent.ConcurrentSkipListMap from the JDK? As far as I can tell, it has no "remove first if its key is smaller than x" operation; which is essential for executing the timers.

Additional changes

The skip list has some JCStress tests added. They're run in the CI in cirrus (github actions are probably too slow and have too few cores for that).

I've also added another sleep microbenchmark which has some cancelled timers. (In theory, this PR is "better" in cleaning up when cancelling, but this doesn't show at least in these microbenchmarks, see below.)

Benchmarks

Microbenchmark results are... worse. The concurrent skip list obviously has some overhead. I don't really understand the SleepDrift changes.

7bbe009489f156a84142f2bf76aeb721df68e41f (series/3.x):

[info] Benchmark                 (size)   Mode  Cnt    Score    Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   25  471.704 ± 56.688  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   25  109.338 ±  1.736  ops/min

[info] running cats.effect.benchmarks.SleepDrift 
Warming up...
Measuring unloaded...
Unloaded overhead: 0.3028073013000001
Measuring heavily loaded...
Loaded overhead: 0.39144714067499997
Measuring unloaded 100x...
Unloaded overhead 100x: 0.16530405503333334
Measuring heavily loaded 100x...
Loaded overhead: 0.38024248699166674


c327ed6e45c3e2db9b4605377b03c8f2f1267756 (this PR):

[info] Benchmark                 (size)   Mode  Cnt    Score    Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   25  405.332 ± 91.998  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   25   76.371 ± 15.325  ops/min

[info] running cats.effect.benchmarks.SleepDrift 
Warming up...
Measuring unloaded...
Unloaded overhead: 0.30705119793333324
Measuring heavily loaded...
Loaded overhead: 0.3892570766333334
Measuring unloaded 100x...
Unloaded overhead 100x: 0.268112889225
Measuring heavily loaded 100x...
Loaded overhead: 0.362987559075

armanbilge · 2023-03-17T03:01:09Z

core/jvm/src/main/scala/cats/effect/unsafe/TimerSkipList.scala

+    val seqNo = {
+      val sn = sequenceNumber.getAndIncrement()


Now that each thread has its own TimerSkipList, does that mean that insert() is called only by a single thread? In which case I suppose sequenceNumber can become an ordinary var?

I think that's a good idea. Although, I want to experiment with adding a shared TimerSkipList (analogous to the "external" task queue). But even in that case, there could be a separate insert method, which obtains a seqNo thread-safely.

Except that won't work, because ancient JVMs have no API for that...

You mean something like this?
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/AtomicIntegerFieldUpdater.html

Huh, thanks. I was thinking about VarHandle (which is JVM 9+), but this could work instead.

So yeah, it seems the updater wants a volatile field. (This is how VarHandle is better.) So it won't help in this case.

Doing a volatile get follwed by a volatile set is almost certainly much worse than just doing getAndIncrement.

Two different fields: meh... that seems like a convoluted workaround for a problem which already has a perfectly good solution.

Doing a volatile get follwed by a volatile set is almost certainly much worse than just doing getAndIncrement.

Is it? I'm no expert, but the logic is certainly simpler (no loop) and it doesn't involve an atomic operation, which I believe ARM doesn't even have primitive instructions for.

Edit: oh, and on x86 a volatile read is the same as an ordinary read.

getAndIncrement on x86 is one CPU instruction (+ maybe a memory fence?), there should be no loop.

On arm, there is probably a loop, so it's less clear there. (But it won't actually loop in the single-threaded case.)

About volatile reads being the same as plain ones (on x86): I found this; it seems they might not exactly be the same.
Edit: although in this specific case, they could be the same

A simple benchmark on x86 showed that getAndIncrement is slightly faster than get + set. So it seems that my "much worse" assumption was too pessimistic; but it's still slightly worse. I don't know about ARM, I don't have a machine I could run benchmarks on.

Thanks for trying that! Ok, good to know. I actually didn't realize getAndIncrement is a single instruction on x86, my bad 😇

djspiewak · 2023-03-29T22:13:23Z

Heyo! Poking in quickly just to ACK this. Impressive and monumental work, @durban. I'm going to try to chip away at reviewing this over the next week or so.

He-Pin · 2023-03-30T17:58:10Z

@durban So you want to reduce the latency of timers being triggered right? this seems will help http4s and fs2 with large timeouts set.

durban · 2023-04-01T20:53:06Z

@He-Pin I'm not sure this helps latency in the normal case (when workers can keep up with their timers). However, it should help in pathological cases, e.g., if a worker has an unusually long task (which should be avoided, but I'm sure it happens in practice), others can steal (and trigger) its timers. Cases when there are a lot of timeouts could possibly also be helped (again, when a worker can't keep up, but others have a chance to steal), but it's hard to say without actually trying.

wjoel · 2023-04-15T19:09:09Z

Fantastic work, well done!

It does better on TechEmpower Framework Benchmarks, at least.
Baseline: https://www.techempower.com/benchmarks/#section=test&shareid=3bb6d870-31bf-4149-8abe-75021cf91489
This PR: https://www.techempower.com/benchmarks/#section=test&shareid=eac550c8-352f-4534-b8b2-ad653ae99725

This was just one run on each, but it's 5-14% faster on most tests, 2-6% slower on a couple (could be noise). Notably it does even better than the baseline at higher levels of concurrency, which is great.

I wanted to try it together with the JVM polling PR by @armanbilge, but the rebasing was too much of a challenge for me (so far, anyway).

durban · 2023-04-17T00:13:23Z

@wjoel Thanks for trying that, it's good to hear.

it does even better than the baseline at higher levels of concurrency

That sounds good, as that's exactly when it should do better (in theory) :-)

djspiewak

Okay, this is amazing work. Also thank you, @wjoel, for checking the performance numbers on your rig!

I reviewed this relatively carefully, but what I haven't looked at are the following two things:

The exact detailed flow of the data structure
The precise modifications to the WorkerThread state machine

The former is heavily tested by JCStress, and it's a port of a widely-used implementation, so I'm inclined to trust it. The latter is more subtle, but we have all the same tests in play and we'll be going through an RC cycle, so I think we're going to be okay.

I have a couple minor comments but they can be addressed in follow-ups. I'm inclined to merge this as-is, but I'll leave it open for 24 hours in case anyone else has any follow-up concerns with merging it.

djspiewak · 2023-04-18T01:14:12Z

core/jvm/src/main/scala/cats/effect/unsafe/TimerSkipList.scala

+  /**
+   * Computes the trigger time in an overflow-safe manner. The trigger time is essentially `now
+   * + delay`. However, we must constrain all trigger times in the skip list to be within
+   * `Long.MaxValue` of each other (otherwise there will be overflow when comparing in `cpr`).


If someone sets a sleep for more than 292 years, that's on them.

djspiewak · 2023-04-18T01:16:19Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkStealingThreadPool.scala

+      while (cont) {
+        val cb = tsl.pollFirstIfTriggered(now)


I wonder if a longer-term optimization could involve batched polling. In theory, skip lists should support a relatively efficient batch poll, though it would require some work to implement.

Another possibility is that we might not actually want to steal all expired timers from other workers. It might be better just to steal a single one. Needs some thought.

Makes sense (both of those possibilities). In any case, this could be tweaked easily later. (Okay, batched polling might not be easy, but doable.)

djspiewak · 2023-04-18T01:18:37Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkStealingThreadPool.scala

    }
  }

-  private[this] val CancelSentinel: Runnable = () => ()
+  /**
+   * Chooses a random `TimerSkipList` from this pool, and inserts the `callback`.


Is this actually better than execute(() => sleep(...))? Doing it randomly may put it on a worker which is overloaded, while doing it using the external work queue will bias towards threads which are making progress.

I'm not sure. On the one hand, what you're saying makes sense. On the other hand, if it goes through the external queue, it will be only checked every 64 ticks, not on every tick. (Microbenchmark result changes seemed like just noise when I was doing this change.)

it will be only checked every 64 ticks, not on every tick

Oh hang on, is the new timers implementation in this PR checking on every tick? I had thought this wasn't necessary, since 64 ticks was also the granularity for picking up tasks from the external scheduler anyway.

Expired timers are checked on every tick since, it seems #3432. I did not change that.

Thanks, hmm, so it does. Sorry, I guess I forgot that, @djspiewak had we talked about that change before?

In any case, the reason it caught my attention here is because now that timers are shared, doesn't that mean we have a read(/write) barrier on every iteration of the worker loop? I thought that was something we were trying to avoid, we'd even talked before about removing this volatile read.

cats-effect/core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

Line 364 in d334a81

while (!done.get()) {

That is a good point. Without this PR, that timer check is just a few thread-local reads. With this PR, it's 1 or 2 volatile reads in the common case (i.e., no expired timer). When I wrote "the concurrent skip list obviously has some overhead", I was thinking about things like this (although not this specific case). The good thing is that those volatile reads should be "usually" "mostly" thread-local, but still... I wonder which benchmarks should be checked for this (I've only ran the sleeping ones).

Checking expired timers less frequently could make sense, I think. But I'm not deeply familiar with #3432.

"Fun" fact: those reads doesn't need to be volatile (in the JSR-166 CSLM they aren't), they could be getAcquire...

Yeah, so I don't think benchmarking against the old/current implementation is very useful, since it has the issue about not removing expired timers, and it seems like was a significant performance drain.

So to measure the cost of this additional synchronization I think we need an "intermediate" implementation that is thread local but removes expired timers.

Also the cost of those volatile reads might be much more on ARM due to its weaker memory model. But, I also don't know :)

djspiewak · 2023-04-18T01:21:28Z

tests/shared/src/test/scala/cats/effect/IOSpec.scala

@@ -1615,6 +1617,31 @@ class IOSpec extends BaseSpec with Discipline with IOPlatformSpecification {
        affected must beTrue
      }

+      "random racing sleeps" in real {


Is this even a meaningful test on JavaScript/Native? Might it be easier just to push it into the platform specs, allowing us to remove iterations?

No, you're right. Done in a72c13e.

djspiewak · 2023-04-18T01:25:06Z

I'm not sure this helps latency in the normal case (when workers can keep up with their timers)

@He-Pin @durban Actually, this should help significantly even in the normal case (relative to the baseline of 3.4.x) because we avoid a lot of context switching and thrashing in the kernel scheduler (from the extra timer thread). This PR basically represents all the benefits of the previous timers implementation while resolving all the problems it introduced around contention and accumulating garbage (as well as solving the stealing deficiency).

durban added 15 commits March 15, 2023 20:40

Add TimerSkipList stub

6ff70fa

Use the stub in the WSTP

9bb2c0f

Add TimerSkipList (ref 9e5fec4f)

69b8abe

scalafmt

82e11b0

Fix bug when parking with an expired timer

cf07c9d

Remove unused code, port some existing tests

219b0ea

scalafix, scalafmt

1a40b81

Add skip list tests

c0b6a47

Add parallel skip list tests

1001355

Add test with a lot of IO.sleep

fc27990

Add a JCStress test

fa888a1

Try to run JCStress test in CI

6acf22c

Smaller test size in js and native

412ecea

Stealing timers

e79dc13

Even smaller test size in js

0d31a63

armanbilge reviewed Mar 17, 2023

View reviewed changes

durban added 9 commits March 20, 2023 22:47

Kick CI

76a8d35

Decrease iterations (CI could just be slow)

02df0e9

Add more JCStress tests

2f2f7db

Reduce indirections by using AtomicReferenceFieldUpdater

af16ad2

javafmt

f0c3c88

Try to fix java warnings

d3ad5fe

Kick CI

cad3205

Kick CI

9b24c58

More JCStress tests

843a1d0

durban mentioned this pull request Mar 21, 2023

Failing task in cirrus doesn't fail the github status #3507

Closed

Try to fix JCStress tests

26fc5dd

durban mentioned this pull request Mar 24, 2023

Play around with Lincheck #3442

Open

durban added 2 commits March 25, 2023 14:06

Another JCStress test

d02747b

Merge from series/3.x

0d3f463

durban added 5 commits March 25, 2023 14:59

Cleanup

fda1902

Remove long stress test (it's too much for CI)

fe0d40c

Additional sleep benchmark

0aacf07

Handle sleep in a smarter way

c327ed6

Kick CI

569e295

durban changed the title ~~WIP: Shared timers~~ Shared timers Mar 27, 2023

durban marked this pull request as ready for review March 27, 2023 23:44

Add test for timer stealing

831f6f3

djspiewak previously approved these changes Apr 18, 2023

View reviewed changes

Move racing sleep test to JVM

a72c13e

durban dismissed djspiewak’s stale review via a72c13e April 18, 2023 12:23

djspiewak merged commit ab5d56c into typelevel:series/3.x Apr 19, 2023

wjoel mentioned this pull request Apr 19, 2023

Polling system #3332

Merged

armanbilge mentioned this pull request Apr 19, 2023

Avoid per-iteration read barriers in worker loop #3544

Open

armanbilge mentioned this pull request Aug 8, 2023

Timer stealing without a concurrent data structure #3781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared timers #3499

Shared timers #3499

durban commented Mar 15, 2023 •

edited

Loading

armanbilge Mar 17, 2023

durban Mar 17, 2023

durban Mar 18, 2023

armanbilge Mar 18, 2023

durban Mar 18, 2023

durban Mar 19, 2023

armanbilge Mar 19, 2023 •

edited

Loading

durban Mar 19, 2023 •

edited

Loading

durban Mar 20, 2023

armanbilge Mar 20, 2023

djspiewak commented Mar 29, 2023

He-Pin commented Mar 30, 2023

durban commented Apr 1, 2023

wjoel commented Apr 15, 2023

durban commented Apr 17, 2023

djspiewak left a comment

djspiewak Apr 18, 2023

djspiewak Apr 18, 2023

durban Apr 18, 2023

djspiewak Apr 18, 2023

durban Apr 18, 2023

armanbilge Apr 18, 2023

durban Apr 18, 2023

armanbilge Apr 18, 2023

durban Apr 18, 2023

armanbilge Apr 18, 2023

djspiewak Apr 18, 2023

durban Apr 18, 2023

djspiewak commented Apr 18, 2023

Shared timers #3499

Shared timers #3499

Conversation

durban commented Mar 15, 2023 • edited Loading

What is this

Additional changes

Benchmarks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armanbilge Mar 19, 2023 • edited Loading

Choose a reason for hiding this comment

durban Mar 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djspiewak commented Mar 29, 2023

He-Pin commented Mar 30, 2023

durban commented Apr 1, 2023

wjoel commented Apr 15, 2023

durban commented Apr 17, 2023

djspiewak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djspiewak commented Apr 18, 2023

durban commented Mar 15, 2023 •

edited

Loading

armanbilge Mar 19, 2023 •

edited

Loading

durban Mar 19, 2023 •

edited

Loading