Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared timers #3499

Merged
merged 34 commits into from
Apr 19, 2023
Merged

Shared timers #3499

merged 34 commits into from
Apr 19, 2023

Conversation

durban
Copy link
Contributor

@durban durban commented Mar 15, 2023

What is this

This PR implements timer stealing. Instead of each WorkerThread having a thread-local SleepersQueue (as in current series/3.x), with this PR each worker has a thread-safe concurrent skip list of sleepers (TimerSkipList, based on the JSR-166 ConcurrentSkipListMap). By default each worker uses its own skip list, but they can steal from each other. The skip lists also enable simplifying external timer insertion (because we can just choose a skip list, and insert the callback). Finally, a cancelled timer is now really removed from the skip list (unlike currently from SleepersQueue).

Why not reuse java.util.concurrent.ConcurrentSkipListMap from the JDK? As far as I can tell, it has no "remove first if its key is smaller than x" operation; which is essential for executing the timers.

Additional changes

The skip list has some JCStress tests added. They're run in the CI in cirrus (github actions are probably too slow and have too few cores for that).

I've also added another sleep microbenchmark which has some cancelled timers. (In theory, this PR is "better" in cleaning up when cancelling, but this doesn't show at least in these microbenchmarks, see below.)

Benchmarks

Microbenchmark results are... worse. The concurrent skip list obviously has some overhead. I don't really understand the SleepDrift changes.

7bbe009489f156a84142f2bf76aeb721df68e41f (series/3.x):

[info] Benchmark                 (size)   Mode  Cnt    Score    Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   25  471.704 ± 56.688  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   25  109.338 ±  1.736  ops/min

[info] running cats.effect.benchmarks.SleepDrift 
Warming up...
Measuring unloaded...
Unloaded overhead: 0.3028073013000001
Measuring heavily loaded...
Loaded overhead: 0.39144714067499997
Measuring unloaded 100x...
Unloaded overhead 100x: 0.16530405503333334
Measuring heavily loaded 100x...
Loaded overhead: 0.38024248699166674


c327ed6e45c3e2db9b4605377b03c8f2f1267756 (this PR):

[info] Benchmark                 (size)   Mode  Cnt    Score    Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   25  405.332 ± 91.998  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   25   76.371 ± 15.325  ops/min

[info] running cats.effect.benchmarks.SleepDrift 
Warming up...
Measuring unloaded...
Unloaded overhead: 0.30705119793333324
Measuring heavily loaded...
Loaded overhead: 0.3892570766333334
Measuring unloaded 100x...
Unloaded overhead 100x: 0.268112889225
Measuring heavily loaded 100x...
Loaded overhead: 0.362987559075

Comment on lines +195 to +196
val seqNo = {
val sn = sequenceNumber.getAndIncrement()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that each thread has its own TimerSkipList, does that mean that insert() is called only by a single thread? In which case I suppose sequenceNumber can become an ordinary var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea. Although, I want to experiment with adding a shared TimerSkipList (analogous to the "external" task queue). But even in that case, there could be a separate insert method, which obtains a seqNo thread-safely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that won't work, because ancient JVMs have no API for that...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, thanks. I was thinking about VarHandle (which is JVM 9+), but this could work instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah, it seems the updater wants a volatile field. (This is how VarHandle is better.) So it won't help in this case.

Doing a volatile get follwed by a volatile set is almost certainly much worse than just doing getAndIncrement.

Two different fields: meh... that seems like a convoluted workaround for a problem which already has a perfectly good solution.

Copy link
Member

@armanbilge armanbilge Mar 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a volatile get follwed by a volatile set is almost certainly much worse than just doing getAndIncrement.

Is it? I'm no expert, but the logic is certainly simpler (no loop) and it doesn't involve an atomic operation, which I believe ARM doesn't even have primitive instructions for.

Edit: oh, and on x86 a volatile read is the same as an ordinary read.

Copy link
Contributor Author

@durban durban Mar 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAndIncrement on x86 is one CPU instruction (+ maybe a memory fence?), there should be no loop.

On arm, there is probably a loop, so it's less clear there. (But it won't actually loop in the single-threaded case.)

About volatile reads being the same as plain ones (on x86): I found this; it seems they might not exactly be the same.
Edit: although in this specific case, they could be the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple benchmark on x86 showed that getAndIncrement is slightly faster than get + set. So it seems that my "much worse" assumption was too pessimistic; but it's still slightly worse. I don't know about ARM, I don't have a machine I could run benchmarks on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying that! Ok, good to know. I actually didn't realize getAndIncrement is a single instruction on x86, my bad 😇

@durban durban changed the title WIP: Shared timers Shared timers Mar 27, 2023
@durban durban marked this pull request as ready for review March 27, 2023 23:44
@djspiewak
Copy link
Member

Heyo! Poking in quickly just to ACK this. Impressive and monumental work, @durban. I'm going to try to chip away at reviewing this over the next week or so.

@He-Pin
Copy link

He-Pin commented Mar 30, 2023

@durban So you want to reduce the latency of timers being triggered right? this seems will help http4s and fs2 with large timeouts set.

@durban
Copy link
Contributor Author

durban commented Apr 1, 2023

@He-Pin I'm not sure this helps latency in the normal case (when workers can keep up with their timers). However, it should help in pathological cases, e.g., if a worker has an unusually long task (which should be avoided, but I'm sure it happens in practice), others can steal (and trigger) its timers. Cases when there are a lot of timeouts could possibly also be helped (again, when a worker can't keep up, but others have a chance to steal), but it's hard to say without actually trying.

@wjoel
Copy link
Contributor

wjoel commented Apr 15, 2023

Fantastic work, well done!

It does better on TechEmpower Framework Benchmarks, at least.
Baseline: https://www.techempower.com/benchmarks/#section=test&shareid=3bb6d870-31bf-4149-8abe-75021cf91489
This PR: https://www.techempower.com/benchmarks/#section=test&shareid=eac550c8-352f-4534-b8b2-ad653ae99725

This was just one run on each, but it's 5-14% faster on most tests, 2-6% slower on a couple (could be noise). Notably it does even better than the baseline at higher levels of concurrency, which is great.

I wanted to try it together with the JVM polling PR by @armanbilge, but the rebasing was too much of a challenge for me (so far, anyway).

@durban
Copy link
Contributor Author

durban commented Apr 17, 2023

@wjoel Thanks for trying that, it's good to hear.

it does even better than the baseline at higher levels of concurrency

That sounds good, as that's exactly when it should do better (in theory) :-)

djspiewak
djspiewak previously approved these changes Apr 18, 2023
Copy link
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is amazing work. Also thank you, @wjoel, for checking the performance numbers on your rig!

I reviewed this relatively carefully, but what I haven't looked at are the following two things:

  • The exact detailed flow of the data structure
  • The precise modifications to the WorkerThread state machine

The former is heavily tested by JCStress, and it's a port of a widely-used implementation, so I'm inclined to trust it. The latter is more subtle, but we have all the same tests in play and we'll be going through an RC cycle, so I think we're going to be okay.

I have a couple minor comments but they can be addressed in follow-ups. I'm inclined to merge this as-is, but I'll leave it open for 24 hours in case anyone else has any follow-up concerns with merging it.

/**
* Computes the trigger time in an overflow-safe manner. The trigger time is essentially `now
* + delay`. However, we must constrain all trigger times in the skip list to be within
* `Long.MaxValue` of each other (otherwise there will be overflow when comparing in `cpr`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone sets a sleep for more than 292 years, that's on them.

Comment on lines +235 to +236
while (cont) {
val cb = tsl.pollFirstIfTriggered(now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a longer-term optimization could involve batched polling. In theory, skip lists should support a relatively efficient batch poll, though it would require some work to implement.

Another possibility is that we might not actually want to steal all expired timers from other workers. It might be better just to steal a single one. Needs some thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense (both of those possibilities). In any case, this could be tweaked easily later. (Okay, batched polling might not be easy, but doable.)

}
}

private[this] val CancelSentinel: Runnable = () => ()
/**
* Chooses a random `TimerSkipList` from this pool, and inserts the `callback`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually better than execute(() => sleep(...))? Doing it randomly may put it on a worker which is overloaded, while doing it using the external work queue will bias towards threads which are making progress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. On the one hand, what you're saying makes sense. On the other hand, if it goes through the external queue, it will be only checked every 64 ticks, not on every tick. (Microbenchmark result changes seemed like just noise when I was doing this change.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be only checked every 64 ticks, not on every tick

Oh hang on, is the new timers implementation in this PR checking on every tick? I had thought this wasn't necessary, since 64 ticks was also the granularity for picking up tasks from the external scheduler anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expired timers are checked on every tick since, it seems #3432. I did not change that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, hmm, so it does. Sorry, I guess I forgot that, @djspiewak had we talked about that change before?

In any case, the reason it caught my attention here is because now that timers are shared, doesn't that mean we have a read(/write) barrier on every iteration of the worker loop? I thought that was something we were trying to avoid, we'd even talked before about removing this volatile read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. Without this PR, that timer check is just a few thread-local reads. With this PR, it's 1 or 2 volatile reads in the common case (i.e., no expired timer). When I wrote "the concurrent skip list obviously has some overhead", I was thinking about things like this (although not this specific case). The good thing is that those volatile reads should be "usually" "mostly" thread-local, but still... I wonder which benchmarks should be checked for this (I've only ran the sleeping ones).

Checking expired timers less frequently could make sense, I think. But I'm not deeply familiar with #3432.

"Fun" fact: those reads doesn't need to be volatile (in the JSR-166 CSLM they aren't), they could be getAcquire...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so I don't think benchmarking against the old/current implementation is very useful, since it has the issue about not removing expired timers, and it seems like was a significant performance drain.

So to measure the cost of this additional synchronization I think we need an "intermediate" implementation that is thread local but removes expired timers.

Also the cost of those volatile reads might be much more on ARM due to its weaker memory model. But, I also don't know :)

@@ -1615,6 +1617,31 @@ class IOSpec extends BaseSpec with Discipline with IOPlatformSpecification {
affected must beTrue
}

"random racing sleeps" in real {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this even a meaningful test on JavaScript/Native? Might it be easier just to push it into the platform specs, allowing us to remove iterations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're right. Done in a72c13e.

@djspiewak
Copy link
Member

I'm not sure this helps latency in the normal case (when workers can keep up with their timers)

@He-Pin @durban Actually, this should help significantly even in the normal case (relative to the baseline of 3.4.x) because we avoid a lot of context switching and thrashing in the kernel scheduler (from the extra timer thread). This PR basically represents all the benefits of the previous timers implementation while resolving all the problems it introduced around contention and accumulating garbage (as well as solving the stealing deficiency).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants