-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare with kagome's impl #14
Comments
I ran kagome's benchmarks on top of the optimisation in #24, without taking advantage of our SIMD optimisation. num_validators: 100 (the number of chunks)
In a nutshell, according to this benchmark, the C++ implementation is:
As you can see, in general, regardless of the implementation, decoding time is much more significant than encoding time. Currently, parity's polkadot implementation only does decoding for data that is larger than 128Kib. However, optimisations such as systematic recovery will almost completely deny large-scale usage of the decoding procedure. Decoding from systematic chunks is 4000% faster than regular reconstruction. Still, we need to measure how significant this difference is in a real-world scenario, where we test the entire availability-recovery process. I'll post results soon |
Also, how it relates to validator count. Target is 1000. Does not really matter if one implementation is faster at 100, if it is slower with the number we are actually interested in. |
As suggested, reran the synthetic benchmark with 1000 validators. Apple M2 Pro with 32 Gib RAM
C++ implementation is:
I reach the same conclusion, that in the synthetic benchmark, C++ implementation is 40-60% faster. |
I also have the results of running with subsystem-bench with the following scenarios:
Same system, Apple M2 Pro with 32 Gib RAM, without our SIMD optimisations Regular chunk recovery in perfect network conditionsConfig:
Note that this simulates perfect network conditions (0 errors, 0 latency and enough bandwidth) C++ Results:
Rust results:
C++ implementation consumes 50% less CPU when doing regular chunk recovery. Systematic chunk recovery in perfect network conditionsSame configuration as regular recovery. C++ Results:
Rust results:
C++ implementation consumes 32% less CPU when doing regular chunk recovery. Systematic chunk recovery in with 1-100ms network latencyThe configuration difference is that per-request latency is added as a random number between 1ms and 100ms. C++ Results:
Rust results:
C++ implementation consumes 23% less CPU when doing regular chunk recovery with added network latency. Conclusionkagome's performance advantage of 40-70% diminishes significantly as we implement systematic recovery and benchmark in realistic scenarios where network latency and errors exist. It could still be worthwhile to see if we can find the key differences between the implementations and why one is faster than the other. However, given these results and the work on systematic recovery, it's not fully obvious to me that availability-recovery will remain a bottleneck for scaling polkadot. I'd like to also benchmark with our latest AVX implementation and see how that looks. |
Ran roughly the same benchmarks with our rust impl with AVX. Machine: GCP c2-standard-8 In kagome's synthetic benchmark:
In susbsystem-bench:
My conclusion is that when comparing systematic recovery with our AVX implementation, the CPP implementation consumes 23% less CPU in perfect network conditions. I expect this percentage to drop further in real network conditions with network latency and errors. |
I've been comparing the two impls (rust and C++) in search for any differences. The two implementations are pretty much identical in terms of code. I've run the benchmarks under valgrind massif and there's an immense difference in terms of allocations/deallocations and memory usage. However, peak memory usage is identical. for encoding 2.5 Mib of data, C++ allocates in total about 23.64 Mib. For the same procedure, the rust impl allocates three times the data: 61.91 Mib. Note that this is not peak usage, this is total usage. After doing some micro-optimisations (mainly switching from iterators to for loops), I managed to get our impl to only do 5x more allocations and allocate only 2 times more data than the C++ one. However, this only resulted in a ~8% performance increase compared to the previous rust impl. Therefore, this can't be the only place where the inefficiency stems from. I also compared cache misses (with cachegrind) and there is no significant difference that could justify performance difference. I am starting to think that the C++ compiler is simply much better. Another interesting fact is that the benchmark binary with rust is 36 times as large as the cpp one (with debug symbols stripped) |
Thanks, this is quite interesting that an order of magnitude more allocations don't make up for anything in perf numbers.
8% is still a win, please publish the PR :)
Not sure how cachegrind works, but we should try this with intel perf counters on a bare metal machine.
I can't believe it really is 30% better 😄 , we might still be missing something. I suggest to also look when building the binary which CPU we are targetting, maybe |
Yes. I'll do that.
Yes, I'll do that too if I can get one.
Already did that, no difference |
I opened a PR with the optimisation that brings the 8% improvement: #31 Initially, I thought it may also be thanks to a change that I had locally that cut allocations by 30% (apparently, that makes no difference). I also explored whether it's a memory alignment issue, considering that we operate on vectors of u8 but the inner algorithm expects vector of u16. I tried using |
I've ran the two encode-only implementations with For a reported duration difference of about 18% (cpp being the faster):
|
I have some great news! I successfully root-caused the big performance difference between the two implementations. Looking at the generated assembly for the After replacing regular panick-ing Turns out that indeed the rust safety guarantees were getting in the way. I think I can come up with an acceptable PR that relies on one-time Here are the numbers on a c2-standard-8 with no AVX with the synthetic benchmark:
I'd say they are within the noise threshold of each other. I'll run with AVX soon and post the result. I'm confident we'll be even faster. Moreover, when running the subsystem-bench on avilability-recovery with no AVX we're already consuming around 10% less CPU with the rust implementation. |
Posted PR for optimising/removing bounds checks for the non-avx code path (the one present in production): #34. I'll post another PR for the avx feature |
I added similar bounds checks changes to the AVX code paths and it's actually a bit slower than the regular non-avx encoding for large data. |
I think we can close this now, as we discovered the underlying difference. There are still a couple of bounds checks that I didn't manage to optimise without |
Isn't 10mb our theoreticaly parachain blocksize? I think 50s sound way too slower than desired. This is single threaded? We researchers should rereview this code for performance from a theory level, but almost surely this comes from cache pressure. Can you tell me the performacne if the cache is preppoulated? Aka run it a couple times and then run the timed instances without doing anything to clear cache. |
I think the absolute numbers in the C++ benchmark are not per cycle (so divide by 100). On my machine, encoding is around 50MiB/s for 5MiB data, so around 100ms |
Awsome, thanks. We should still tease out how much we lose to cash pressure, since that impacts our real performancein the node, so many another 10x hiding somewhere. Already 50 MiB/s sounds compeditive with software AES on 10 year old hardware, so that's pretty solid. AES has a lookup table, but only a very tiny one. I'd expected decode to be faster since it's AFFT has like half the size. It needs a setup phase, but this should amortize away for larger blocksizes. I've maybe forgotten some inefficency here, but it's really doing an AFFT of half the size. |
The text was updated successfully, but these errors were encountered: