EMA based statistically adaptive thread pool design #108

vertexclique · 2019-08-24T23:42:04Z

This is an adaptive thread pool implementation with exponential moving averages(where it doesn't rely on idle time).

Here are the benchmarks:

This new statistical-EMA-approach:

CPU utilization ~97%

running 1 test
test blocking ... bench:  39,116,408 ns/iter (+/- 5,686,213)

Current approach

CPU utilization ~88%

running 1 test
test blocking ... bench: 254,204,382 ns/iter (+/- 214,099,973)

This approach enables 5.5x speedup.

On top of that, this PR solves #65 intrinsically
Just because it is not nice to manually include modules at the benchmark level.

Documentation

yoshuawuyts · 2019-08-26T02:40:39Z

I'm in no means qualified to review this code, but the results are looking very cool! awesome work @vertexclique! 🎉

src/task/blocking.rs

spacejam · 2019-08-26T11:47:35Z

Thanks for continuing to boil this down! These are really nice results that address the main runaway thread growth issue in the naive original implementation.

I'm working through some more correctness proofs on paper to try to find more ways to break the large number of Relaxed atomic accesses that are added in this.

vertexclique · 2019-08-26T11:50:40Z

Thanks, Tyler. Very nice to hear that. If you are doing formal verification please share with me too.

spacejam · 2019-08-26T12:09:03Z

Hehehe the reason I'm using paper for a couple things is because I'm a little uncertain about translating Relaxed stuff into TLA (maybe using sets of all reorderable code and permuting them?)

Basically, anywhere a completely random value impacts runtime correctness is the only thing I'm looking for and desk-checking backwards from.

vertexclique · 2019-08-26T12:29:58Z

More to improve (aka TO-DO):

Adaptive scale down (we can do this sometime later, it is harder and need more sync points than above)
Compartmentalize the pool creation into so-called execution context's so they satisfy the needs in the issue More flexible reactor/executor API #79 . This PR also helps that in an indirect way. (Atomics can be moved for that mentioned issue, no worries)

spacejam · 2019-08-26T12:40:45Z

If more synchronization needs to occur, it probably isn't a bad time to take a step back from the 100% atomics-based approach and either use ROWEX when an adaptation is required (spawning readers work lock-free, adapting writers need to claim an exclusive mutex to reduce the concurrent interleaving space) or just using a Mutex so that the implementation is obviously correct. Thinking about all of these atomic accesses is quite labor intensive and error-prone.

Mutexes are quite fast compared to anything that needs to do IO, and I have doubts that using one would cause any impact on performance on a workload that this module is involved with.

vertexclique · 2019-08-26T13:01:08Z

I think todos can go to other PRs. For Mutexes (if I understand you correctly you are talking about the scaling down part) it can be useful, but I don't think they should be used for this code since there is no critical section code here, I don't like abusing lock-exclusion in places which are not needed. (But I am not fully against that).

I am rooting for this PR to taking it like how it is for the blocking pool rn. Then we will think on that later in the process of extraction (which will also be for async thread pool, I think).

spacejam · 2019-08-26T15:37:37Z

I guess what I'm mostly getting at is that there are some potentially very subtle information flows that can happen when this math happens to be operating in an entirely concurrent way on global variables via Relaxed atomics (or even SeqCst atomics), and this is expensive overall to reason about as a human (it's also expensive/unreasonable for the formal verification tooling I know about to provide meaningful validations of).

spacejam · 2019-08-26T15:41:56Z

There is almost never a real need for exclusion, but it makes PR's like this understandable without staring at it for hours :P

src/task/blocking.rs

spacejam · 2019-08-26T16:09:00Z

In general, please change all of the atomics to use barriers that prevent potentially risky reorderings. Note that in the original, relaxed atomics were used with a clear explanation for why randomness in their use path violated no correctness criteria. In this code, it's much much much more complex, and relaxed atomics make this quite difficult to reason about.

Mutexes will make this very easy for me to review and merge, and will not impact performance, so I recommend that you replace all of the atomic stuff with them, but I totally get that it's not as fun!

spacejam · 2019-08-27T09:55:51Z

After removing all Relaxed atomics, I have some concerns about long-term maintenance costs of this code due to the large amount of unserialized atomic access, and math without clear documentation. Please document the math more clearly for people like me who are less familiar with it, and please document the atomic access patterns with explanations of what the important correctness criteria are at each dependency site so that future contributors don't introduce bugs.

Since bugs have been found in review, it would also make me feel better if some concurrency test was added that exercises the interleavings more.

You may think that this is a bit draconian, but lock-free programming is not easy for humans to get right, and this code is not obviously correct to me yet.

This needs a bit more work before it's safe from a long-term maintenance perspective. The alternative is to wrap all atomic variables other than FREQUENCY in a Mutex so that their complex math is serialized. Please try this out if you're uninterested in adding concurrency tests. Mutex unlocks are less than 16ns, and as long as you're just doing this math on variables inside the Mutex and not doing any IO or running a blocking function with the Mutex held, it will be really really fast.

So, please make this code obviously correct by either using a simple Mutex around all of the variables other than FREQUENCY or by adding concurrent interleaving tests that will cause assumptions we're making to pop out.

benches/blocking.rs

vertexclique · 2019-08-28T15:21:29Z

r? @spacejam

src/task/blocking.rs

killercup · 2019-08-28T19:09:32Z

src/task/blocking.rs

+// on the request rate.
+//
+// It uses frequency based calculation to define work. Utilizing average processing rate.
+fn scale_pool() {


In another context there was some discussion around adding tracing (using log's trace! macro) to async-std. This functions seems like a good place for that, too.

log::trace or the new tracing library? wondering about how much overhead each would add

I just know that the idea was thrown around, and it was probably an opt-in feature, too. @stjepang or @yoshuawuyts would know more :)

log now has kv support, which we're already using in other places too. tracing can pick these up too, so using log is probably the best choice here overall (:

I think we should get the kv support in tracing-log soon-ish? However I don't believe that log supports the span-like macros, so async-std might need to have its own, internal span-like macro and dispatch conditionally to async-log or tracing.

I haven't considered how the differences between async-log's span and tracing's span can be bridged, but I think it's feasible if tracing's span!().enter() is used to mimic async-log's nested span block.

@davidbarsky we could use async-log::span for this, probably. Though there's still a few open issues on the repo, the interface wouldn't change.

From convos with Eliza during RustConf there's def a desire to bridge between the two approaches, and we mostly need to figure out how to go about it. async-rs/async-log#7 is what I've currently got (repo incoming soon), but need to check how well that works (:

@yoshuawuyts Yep! I think we're on the same page about wanting to bridge the two libraries.

I'm spitballing one mechanism on bridging that gap, which consists of using mutually exclusive feature flags in a build.rs to dispatch to either async-log or tracing. This would entail using an async-std-specific span macro until the gaps—if any exist—between async-std's span and tracing's span are closed. I'm personally excited to dig into rust-lang/log#353!

I'm sorry if what I communicated wasn't clear!

vertexclique · 2019-08-30T00:11:38Z

So, to sum up, all the changes that I pushed:

Values moved around are mutex rn.
Now I cleanly calculate the values and also documented more explicitly how things are working. Even formula, very explicitly.
Fixed Blocking pool doesn't have backpressure on OSX #126 (OS-specific problem, hard to hit but possible)
No significant degradation in performance. For the curious:

running 2 tests
test blocking        ... bench:  48,959,842 ns/iter (+/- 46,653,507)
test blocking_single ... bench:       3,322 ns/iter (+/- 8,452)

I think I covered most of the cases with tests. Skipped all of them since they are exhaustive. But for the curious cats again here how you can run them:

$ cargo test longhauling_task_join -- --ignored --exact --nocapture
$ cargo test slow_join_interrupted -- --ignored --exact --nocapture
$ cargo test slow_join -- --ignored --exact --nocapture

NOTE: Benchmarks cover the common cases. Tests are covering the possible edge cases.

killercup

Thanks for writing these docs! I left some inline comments to clarify some of it. In general I think this is really neat, concise (just 300 LOC!) implementation. With a bit more docs this could be used as reference material to understand the paper and task scheduling, but I don't think we'll need to focus on that before landing this PR :)

src/task/blocking.rs

spacejam · 2019-08-30T15:20:58Z

tests/thread_pool.rs

+    // Should be less than 200_000 ns
+    // Previous implementation will panic when this test is running.
+    assert_eq!(elapsed < 200_000, true);
+}


these tests are generally non-reproducible, and it's unclear if they are valuable. they tend to fail on my fairly modern 2018 i7 mbp

when I said concurrency tests before, I meant adding randomized delays to operations that involve cross-thread coordination via atomics (or higher level primitives that invoke them internally). These might be OK as a benchmark so you can compare results on your own machine before and after changes, but I don't think they are useful as tests.

Please try running the tests with master's blocking.rs. They will panic. Also, please take a look to referenced issue.

They panic on this branch on a fairly modern machine, so I'm not sure they are useful on machines other than yours. In general, timing-based performance tests are an antipattern, as they have big reproducibility issues. If you want to do this kind of testing, it's better to use something like cachegrind to count CPU instructions executed.

Because they don't actually assert on anything now (which was what caused them to fail on my laptop), I think they are better as benchmarks, right? I don't think it makes sense to have tests that can't fail.

I don't think it makes sense to have tests that can't fail.

Please read my comments carefully. You can see what it tests in action:
https://asciinema.org/a/R6b3UcFo0xfNo1sCPOPJrB04T

when I said concurrency tests before, I meant adding randomized delays to operations that involve cross-thread coordination via atomics (or higher level primitives that invoke them internally).

Unexposed architecture shouldn't be tested with whitebox fashion.

Where are the assertions that fail? The asciicinema is too long to watch.

Possible concurrent interleavings are exposed architecture, and this kind of test has resulted in a huge number of bugs being discovered in various lock-free projects I've been involved in. Concurrency is really hard to get right sometimes, and the original PR with more unserialized atomic accesses would have had bugs jump out pretty quickly with this approach.

src/task/blocking.rs

spacejam · 2019-09-02T08:32:00Z

src/task/blocking.rs

+/// # Arguments
+///
+/// * `freq_queue` - Sliding window of frequency samples
+#[inline]


in general, humans are bad at guessing whether something should be inlined. Usually in rust it's better to leave these attributes out unless some code is going to be executed in a super hot path and you've done work to ensure the improvement is measurable

This comment is not contributing to the overall discussion + review. I’ve found it quite a bit out of context and also from the book. I've made the benchmarks already with and without and decided to leave it inlined. You can see below my last run again after your comment (run shouldn't be needed because of how compiler inlines and unrolls):

Mean of 10 consecutive runs of blocking benchmark:

43057207.9 ns without inlining 41816626.1 ns with inlining

That is also better for users of this library indirectly.

I'm a bit skeptical that change is related to the inline here because this is a function that gets called 5 times per second, and that measurement skew is ~1ms. For a workload that lasts ~40ms, it's not clear to me how measuring things that happen on the order of once every 200ms is relevant.

I'm asking you to simplify your proposed code because we are going to have to keep it working over time, and it's important not to add bits that add complexity for humans without clear benefits.

vertexclique · 2019-09-11T15:13:09Z

Please track the progress of #181

vertexclique added 2 commits August 25, 2019 01:34

EMA based statistically adaptive blocking thread pool

0e6b30f

Make cargo fmt happy

4a1afcd

ghost requested a review from spacejam August 26, 2019 07:05

spacejam reviewed Aug 26, 2019

View reviewed changes