-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared timers #3499
Shared timers #3499
Conversation
val seqNo = { | ||
val sn = sequenceNumber.getAndIncrement() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that each thread has its own TimerSkipList
, does that mean that insert()
is called only by a single thread? In which case I suppose sequenceNumber
can become an ordinary var
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good idea. Although, I want to experiment with adding a shared TimerSkipList
(analogous to the "external" task queue). But even in that case, there could be a separate insert method, which obtains a seqNo thread-safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except that won't work, because ancient JVMs have no API for that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean something like this?
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/AtomicIntegerFieldUpdater.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, thanks. I was thinking about VarHandle
(which is JVM 9+), but this could work instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yeah, it seems the updater wants a volatile field. (This is how VarHandle
is better.) So it won't help in this case.
Doing a volatile get
follwed by a volatile set
is almost certainly much worse than just doing getAndIncrement
.
Two different fields: meh... that seems like a convoluted workaround for a problem which already has a perfectly good solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a volatile
get
follwed by a volatileset
is almost certainly much worse than just doinggetAndIncrement
.
Is it? I'm no expert, but the logic is certainly simpler (no loop) and it doesn't involve an atomic operation, which I believe ARM doesn't even have primitive instructions for.
Edit: oh, and on x86 a volatile read is the same as an ordinary read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getAndIncrement
on x86 is one CPU instruction (+ maybe a memory fence?), there should be no loop.
On arm, there is probably a loop, so it's less clear there. (But it won't actually loop in the single-threaded case.)
About volatile reads being the same as plain ones (on x86): I found this; it seems they might not exactly be the same.
Edit: although in this specific case, they could be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple benchmark on x86 showed that getAndIncrement
is slightly faster than get
+ set
. So it seems that my "much worse" assumption was too pessimistic; but it's still slightly worse. I don't know about ARM, I don't have a machine I could run benchmarks on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for trying that! Ok, good to know. I actually didn't realize getAndIncrement
is a single instruction on x86, my bad 😇
Heyo! Poking in quickly just to |
@durban So you want to reduce the latency of timers being triggered right? this seems will help http4s and fs2 with large timeouts set. |
@He-Pin I'm not sure this helps latency in the normal case (when workers can keep up with their timers). However, it should help in pathological cases, e.g., if a worker has an unusually long task (which should be avoided, but I'm sure it happens in practice), others can steal (and trigger) its timers. Cases when there are a lot of timeouts could possibly also be helped (again, when a worker can't keep up, but others have a chance to steal), but it's hard to say without actually trying. |
Fantastic work, well done! It does better on TechEmpower Framework Benchmarks, at least. This was just one run on each, but it's 5-14% faster on most tests, 2-6% slower on a couple (could be noise). Notably it does even better than the baseline at higher levels of concurrency, which is great. I wanted to try it together with the JVM polling PR by @armanbilge, but the rebasing was too much of a challenge for me (so far, anyway). |
@wjoel Thanks for trying that, it's good to hear.
That sounds good, as that's exactly when it should do better (in theory) :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is amazing work. Also thank you, @wjoel, for checking the performance numbers on your rig!
I reviewed this relatively carefully, but what I haven't looked at are the following two things:
- The exact detailed flow of the data structure
- The precise modifications to the
WorkerThread
state machine
The former is heavily tested by JCStress, and it's a port of a widely-used implementation, so I'm inclined to trust it. The latter is more subtle, but we have all the same tests in play and we'll be going through an RC cycle, so I think we're going to be okay.
I have a couple minor comments but they can be addressed in follow-ups. I'm inclined to merge this as-is, but I'll leave it open for 24 hours in case anyone else has any follow-up concerns with merging it.
/** | ||
* Computes the trigger time in an overflow-safe manner. The trigger time is essentially `now | ||
* + delay`. However, we must constrain all trigger times in the skip list to be within | ||
* `Long.MaxValue` of each other (otherwise there will be overflow when comparing in `cpr`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone sets a sleep
for more than 292 years, that's on them.
while (cont) { | ||
val cb = tsl.pollFirstIfTriggered(now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a longer-term optimization could involve batched polling. In theory, skip lists should support a relatively efficient batch poll, though it would require some work to implement.
Another possibility is that we might not actually want to steal all expired timers from other workers. It might be better just to steal a single one. Needs some thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense (both of those possibilities). In any case, this could be tweaked easily later. (Okay, batched polling might not be easy, but doable.)
} | ||
} | ||
|
||
private[this] val CancelSentinel: Runnable = () => () | ||
/** | ||
* Chooses a random `TimerSkipList` from this pool, and inserts the `callback`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually better than execute(() => sleep(...))
? Doing it randomly may put it on a worker which is overloaded, while doing it using the external work queue will bias towards threads which are making progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. On the one hand, what you're saying makes sense. On the other hand, if it goes through the external queue, it will be only checked every 64 ticks, not on every tick. (Microbenchmark result changes seemed like just noise when I was doing this change.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be only checked every 64 ticks, not on every tick
Oh hang on, is the new timers implementation in this PR checking on every tick? I had thought this wasn't necessary, since 64 ticks was also the granularity for picking up tasks from the external scheduler anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expired timers are checked on every tick since, it seems #3432. I did not change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, hmm, so it does. Sorry, I guess I forgot that, @djspiewak had we talked about that change before?
In any case, the reason it caught my attention here is because now that timers are shared, doesn't that mean we have a read(/write) barrier on every iteration of the worker loop? I thought that was something we were trying to avoid, we'd even talked before about removing this volatile read.
while (!done.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. Without this PR, that timer check is just a few thread-local reads. With this PR, it's 1 or 2 volatile reads in the common case (i.e., no expired timer). When I wrote "the concurrent skip list obviously has some overhead", I was thinking about things like this (although not this specific case). The good thing is that those volatile reads should be "usually" "mostly" thread-local, but still... I wonder which benchmarks should be checked for this (I've only ran the sleeping ones).
Checking expired timers less frequently could make sense, I think. But I'm not deeply familiar with #3432.
"Fun" fact: those reads doesn't need to be volatile (in the JSR-166 CSLM they aren't), they could be getAcquire
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so I don't think benchmarking against the old/current implementation is very useful, since it has the issue about not removing expired timers, and it seems like was a significant performance drain.
So to measure the cost of this additional synchronization I think we need an "intermediate" implementation that is thread local but removes expired timers.
Also the cost of those volatile reads might be much more on ARM due to its weaker memory model. But, I also don't know :)
@@ -1615,6 +1617,31 @@ class IOSpec extends BaseSpec with Discipline with IOPlatformSpecification { | |||
affected must beTrue | |||
} | |||
|
|||
"random racing sleeps" in real { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this even a meaningful test on JavaScript/Native? Might it be easier just to push it into the platform specs, allowing us to remove iterations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're right. Done in a72c13e.
@He-Pin @durban Actually, this should help significantly even in the normal case (relative to the baseline of 3.4.x) because we avoid a lot of context switching and thrashing in the kernel scheduler (from the extra timer thread). This PR basically represents all the benefits of the previous timers implementation while resolving all the problems it introduced around contention and accumulating garbage (as well as solving the stealing deficiency). |
What is this
This PR implements timer stealing. Instead of each
WorkerThread
having a thread-localSleepersQueue
(as in currentseries/3.x
), with this PR each worker has a thread-safe concurrent skip list of sleepers (TimerSkipList
, based on the JSR-166ConcurrentSkipListMap
). By default each worker uses its own skip list, but they can steal from each other. The skip lists also enable simplifying external timer insertion (because we can just choose a skip list, and insert the callback). Finally, a cancelled timer is now really removed from the skip list (unlike currently fromSleepersQueue
).Why not reuse
java.util.concurrent.ConcurrentSkipListMap
from the JDK? As far as I can tell, it has no "remove first if its key is smaller than x" operation; which is essential for executing the timers.Additional changes
The skip list has some JCStress tests added. They're run in the CI in cirrus (github actions are probably too slow and have too few cores for that).
I've also added another sleep microbenchmark which has some cancelled timers. (In theory, this PR is "better" in cleaning up when cancelling, but this doesn't show at least in these microbenchmarks, see below.)
Benchmarks
Microbenchmark results are... worse. The concurrent skip list obviously has some overhead. I don't really understand the
SleepDrift
changes.