-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EMA based statistically adaptive thread pool design #108
EMA based statistically adaptive thread pool design #108
Conversation
I'm in no means qualified to review this code, but the results are looking very cool! awesome work @vertexclique! 🎉 |
81837e7
to
ecfce3a
Compare
Thanks for continuing to boil this down! These are really nice results that address the main runaway thread growth issue in the naive original implementation. I'm working through some more correctness proofs on paper to try to find more ways to break the large number of Relaxed atomic accesses that are added in this. |
Thanks, Tyler. Very nice to hear that. If you are doing formal verification please share with me too. |
Hehehe the reason I'm using paper for a couple things is because I'm a little uncertain about translating Relaxed stuff into TLA (maybe using sets of all reorderable code and permuting them?) Basically, anywhere a completely random value impacts runtime correctness is the only thing I'm looking for and desk-checking backwards from. |
More to improve (aka TO-DO):
|
If more synchronization needs to occur, it probably isn't a bad time to take a step back from the 100% atomics-based approach and either use ROWEX when an adaptation is required (spawning readers work lock-free, adapting writers need to claim an exclusive mutex to reduce the concurrent interleaving space) or just using a Mutex so that the implementation is obviously correct. Thinking about all of these atomic accesses is quite labor intensive and error-prone. Mutexes are quite fast compared to anything that needs to do IO, and I have doubts that using one would cause any impact on performance on a workload that this module is involved with. |
I think todos can go to other PRs. For Mutexes (if I understand you correctly you are talking about the scaling down part) it can be useful, but I don't think they should be used for this code since there is no critical section code here, I don't like abusing lock-exclusion in places which are not needed. (But I am not fully against that). I am rooting for this PR to taking it like how it is for the blocking pool rn. Then we will think on that later in the process of extraction (which will also be for async thread pool, I think). |
I guess what I'm mostly getting at is that there are some potentially very subtle information flows that can happen when this math happens to be operating in an entirely concurrent way on global variables via Relaxed atomics (or even SeqCst atomics), and this is expensive overall to reason about as a human (it's also expensive/unreasonable for the formal verification tooling I know about to provide meaningful validations of). |
There is almost never a real need for exclusion, but it makes PR's like this understandable without staring at it for hours :P |
In general, please change all of the atomics to use barriers that prevent potentially risky reorderings. Note that in the original, relaxed atomics were used with a clear explanation for why randomness in their use path violated no correctness criteria. In this code, it's much much much more complex, and relaxed atomics make this quite difficult to reason about. Mutexes will make this very easy for me to review and merge, and will not impact performance, so I recommend that you replace all of the atomic stuff with them, but I totally get that it's not as fun! |
After removing all Relaxed atomics, I have some concerns about long-term maintenance costs of this code due to the large amount of unserialized atomic access, and math without clear documentation. Please document the math more clearly for people like me who are less familiar with it, and please document the atomic access patterns with explanations of what the important correctness criteria are at each dependency site so that future contributors don't introduce bugs. Since bugs have been found in review, it would also make me feel better if some concurrency test was added that exercises the interleavings more. You may think that this is a bit draconian, but lock-free programming is not easy for humans to get right, and this code is not obviously correct to me yet. This needs a bit more work before it's safe from a long-term maintenance perspective. The alternative is to wrap all atomic variables other than FREQUENCY in a Mutex so that their complex math is serialized. Please try this out if you're uninterested in adding concurrency tests. Mutex unlocks are less than 16ns, and as long as you're just doing this math on variables inside the Mutex and not doing any IO or running a blocking function with the Mutex held, it will be really really fast. So, please make this code obviously correct by either using a simple Mutex around all of the variables other than FREQUENCY or by adding concurrent interleaving tests that will cause assumptions we're making to pop out. |
r? @spacejam |
// on the request rate. | ||
// | ||
// It uses frequency based calculation to define work. Utilizing average processing rate. | ||
fn scale_pool() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In another context there was some discussion around adding tracing (using log's trace!
macro) to async-std. This functions seems like a good place for that, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log::trace
or the new tracing
library? wondering about how much overhead each would add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just know that the idea was thrown around, and it was probably an opt-in feature, too. @stjepang or @yoshuawuyts would know more :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log
now has kv support, which we're already using in other places too. tracing
can pick these up too, so using log
is probably the best choice here overall (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should get the kv
support in tracing-log
soon-ish? However I don't believe that log
supports the span
-like macros, so async-std might need to have its own, internal span
-like macro and dispatch conditionally to async-log or tracing.
I haven't considered how the differences between async-log's span
and tracing
's span can be bridged, but I think it's feasible if tracing'
s span!().enter()
is used to mimic async-log's nested span
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidbarsky we could use async-log::span
for this, probably. Though there's still a few open issues on the repo, the interface wouldn't change.
From convos with Eliza during RustConf there's def a desire to bridge between the two approaches, and we mostly need to figure out how to go about it. async-rs/async-log#7 is what I've currently got (repo incoming soon), but need to check how well that works (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yoshuawuyts Yep! I think we're on the same page about wanting to bridge the two libraries.
I'm spitballing one mechanism on bridging that gap, which consists of using mutually exclusive feature flags in a build.rs
to dispatch to either async-log
or tracing
. This would entail using an async-std
-specific span
macro until the gaps—if any exist—between async-std
's span
and tracing
's span
are closed. I'm personally excited to dig into rust-lang/log#353!
I'm sorry if what I communicated wasn't clear!
So, to sum up, all the changes that I pushed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing these docs! I left some inline comments to clarify some of it. In general I think this is really neat, concise (just 300 LOC!) implementation. With a bit more docs this could be used as reference material to understand the paper and task scheduling, but I don't think we'll need to focus on that before landing this PR :)
// Should be less than 200_000 ns | ||
// Previous implementation will panic when this test is running. | ||
assert_eq!(elapsed < 200_000, true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests are generally non-reproducible, and it's unclear if they are valuable. they tend to fail on my fairly modern 2018 i7 mbp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I said concurrency tests before, I meant adding randomized delays to operations that involve cross-thread coordination via atomics (or higher level primitives that invoke them internally). These might be OK as a benchmark so you can compare results on your own machine before and after changes, but I don't think they are useful as tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try running the tests with master's blocking.rs
. They will panic. Also, please take a look to referenced issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They panic on this branch on a fairly modern machine, so I'm not sure they are useful on machines other than yours. In general, timing-based performance tests are an antipattern, as they have big reproducibility issues. If you want to do this kind of testing, it's better to use something like cachegrind to count CPU instructions executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because they don't actually assert on anything now (which was what caused them to fail on my laptop), I think they are better as benchmarks, right? I don't think it makes sense to have tests that can't fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to have tests that can't fail.
Please read my comments carefully. You can see what it tests in action:
https://asciinema.org/a/R6b3UcFo0xfNo1sCPOPJrB04T
when I said concurrency tests before, I meant adding randomized delays to operations that involve cross-thread coordination via atomics (or higher level primitives that invoke them internally).
Unexposed architecture shouldn't be tested with whitebox fashion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the assertions that fail? The asciicinema is too long to watch.
Possible concurrent interleavings are exposed architecture, and this kind of test has resulted in a huge number of bugs being discovered in various lock-free projects I've been involved in. Concurrency is really hard to get right sometimes, and the original PR with more unserialized atomic accesses would have had bugs jump out pretty quickly with this approach.
0377429
to
e934a5e
Compare
/// # Arguments | ||
/// | ||
/// * `freq_queue` - Sliding window of frequency samples | ||
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, humans are bad at guessing whether something should be inlined. Usually in rust it's better to leave these attributes out unless some code is going to be executed in a super hot path and you've done work to ensure the improvement is measurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not contributing to the overall discussion + review. I’ve found it quite a bit out of context
and also from the book
. I've made the benchmarks already with and without and decided to leave it inlined. You can see below my last run again after your comment (run shouldn't be needed because of how compiler inlines and unrolls):
Mean of 10 consecutive runs of blocking
benchmark:
43057207.9 ns without inlining
41816626.1 ns with inlining
That is also better for users of this library indirectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit skeptical that change is related to the inline here because this is a function that gets called 5 times per second, and that measurement skew is ~1ms. For a workload that lasts ~40ms, it's not clear to me how measuring things that happen on the order of once every 200ms is relevant.
I'm asking you to simplify your proposed code because we are going to have to keep it working over time, and it's important not to add bits that add complexity for humans without clear benefits.
Please track the progress of #181 |
This is an adaptive thread pool implementation with exponential moving averages(where it doesn't rely on idle time).
Here are the benchmarks:
This new statistical-EMA-approach:
Current approach
This approach enables 5.5x speedup.
On top of that, this PR solves #65 intrinsically
Just because it is not nice to manually include modules at the benchmark level.
Documentation