Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Make use of state cache when building blocks for a parachain #9770

Closed
bkchr opened this issue Sep 13, 2021 · 20 comments · Fixed by #11407
Closed

Make use of state cache when building blocks for a parachain #9770

bkchr opened this issue Sep 13, 2021 · 20 comments · Fixed by #11407
Assignees
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.

Comments

@bkchr
Copy link
Member

bkchr commented Sep 13, 2021

When we are building blocks for a parachain, we also need to collect a state proof. This means that we will trigger a different code path that currently doesn't support the state cache when we build a block for a parachain.

This code path is taken when building a parachain block:
https://github.com/paritytech/substrate/blob/master/client/service/src/client/call_executor.rs#L234-L260

And this particular call: https://github.com/paritytech/substrate/blob/master/client/service/src/client/call_executor.rs#L234 (as_trie_backend) prevents that we use the state cache.

@bkchr bkchr added the I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. label Sep 13, 2021
@cheme
Copy link
Contributor

cheme commented Sep 14, 2021

For caching accesses during block building, I think the easiest model would be to have a trie node cache.
The issue with storage cache is that we need to access trie node at least once to build the proof.
Trie node cache is very close to a db cache, could be a simple lru put in front of the database state column.

@bkchr
Copy link
Member Author

bkchr commented Sep 14, 2021

Good idea! We should implement an early version and compare that against the state cache. Even if there is a small performance hit,I think we should switch to a trie level cache. The point here being is that parachains currently can not use the state cache for block building and the state cache is rather complicated.

Another bonus point of a trie cache would be that we would solve #9769 as well.

@cheme
Copy link
Contributor

cheme commented Sep 14, 2021

Drafted such cache in this branch : master...cheme:trie_state_cache

@bkchr
Copy link
Member Author

bkchr commented Sep 16, 2021

@cheme do we have a benchmark for comparing state-cache vs trie-cache?

@cheme
Copy link
Contributor

cheme commented Sep 16, 2021

In theory on my branch switching values in https://github.com/cheme/substrate/blob/b23b1a11b17e480d8d4ade4c77f87ffddd1979a1/bin/node/testing/src/bench.rs#L381 should apply to bench, but I am not familiar enough with the bench to say.
Also I started to look on a polkadot build, but just end up fixing bugs and errors.
Just syncing show nothing (network limits), but maybe I should try to synch from a block export to get a rough idea.

@bkchr
Copy link
Member Author

bkchr commented Sep 16, 2021

Can you try to run it and compare it against master?

@cheme
Copy link
Contributor

cheme commented Sep 16, 2021

I will check syncing this afternoon, but would be good a bench specialist points to relevant benchmarks.

@bkchr
Copy link
Member Author

bkchr commented Sep 16, 2021

@arkpar do you know which benchmark we could use for this?

@arkpar
Copy link
Member

arkpar commented Sep 16, 2021

There's no benchmark specifically for the block building that I'm aware of. We could try replacing the storage cache with the trie cache and simply compare importing 100k blocks of an existing parachain with an import command.

@cheme
Copy link
Contributor

cheme commented Sep 16, 2021

I did run 100k first blocks polkadot import, but I did not really see difference (on my pc it's about 3:45min for all scenario (even for no cache at all)). So I guess first 100k blocks are being a bit empty.
I also try runing
cargo run --release -p node-bench -- ::node::import::native::sr25519::transfer_keep_alive::paritydb::small --json
and on a modified version that remove all access to storage cache when size at 0 (otherwhise we keep accessing the local cache), I obtain:

Trie only: [{"name":"Block import (RandomTransfersKeepAlive/small, Native, ParityDb backend)","raw_average":5474621,"average":5441007}]
Storage cache: [{"name":"Block import (RandomTransfersKeepAlive/small, Native, ParityDb backend)","raw_average":6145274,"average":6097683}]
No cache: [{"name":"Block import (RandomTransfersKeepAlive/small, Native, ParityDb backend)","raw_average":6502472,"average":6441670}]

But I did run the bench locally and there is strong chance I may have interfere with it.
So could make sense running more of those node_bench for different settings:
9d5e011
6375499
b5e4111
bc98fa6
ab760ea
but it is quite time consuming.

@bkchr
Copy link
Member Author

bkchr commented Oct 1, 2021

@cheme could you prepare a polkadot branch? And then we do multiple syncs with state-cache/without/with trie-cache?

The state-cache issues are getting more and more pressing, so I would like to switch as fast as possible. However, if you are currently to overwhelmed by the trie migration stuff, I will find someone else.

@cheme
Copy link
Contributor

cheme commented Oct 1, 2021

That's good, trie stuff is a bit pending review, and I think it will be mainly updating the branch from my previous message.
Will be my next next task :)

@bkchr
Copy link
Member Author

bkchr commented Oct 1, 2021

That's good, trie stuff is a bit pending review, and I think it will be mainly updating the branch from my previous message. Will be my next next task :)

What is good? So should I ask someone else? This needs to be done asap

@cheme
Copy link
Contributor

cheme commented Oct 1, 2021

I mean I will prepare a branch (today).

@cheme
Copy link
Contributor

cheme commented Oct 1, 2021

@bkchr this polkadot branch is usable: https://github.com/cheme/polkadot-1/tree/trie_state_cache
polkadot --trie-cache-size 0 --state-cache-size 67108864 to use state cache.
polkadot --trie-cache-size 67108864 --state-cache-size 0 to use trie cache (the one we want tested).

One possible cause of performance regression could be the absence of storage_hash cache (especially when hashing wasm). So testing with
polkadot --trie-cache-size 1 --state-cache-size 67108864 can also be interesting (storage_hash cache is statically sized and having 1 byte cache will reactivate the state cache so it will be active).

@bkchr
Copy link
Member Author

bkchr commented Oct 31, 2021

I have written a benchmark:

Benchmarking State access/with 128MB state cache: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 81.8s.
State access/with 128MB state cache                                                                          
                        time:   [362.44 ms 393.31 ms 434.62 ms]

Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe
Benchmarking State access/with 128MB trie cache: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 57.4s.
State access/with 128MB trie cache                                                                          
                        time:   [5.4999 s 5.6672 s 5.8909 s]

Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking State access/with cache disabled: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 67.7s.
State access/with cache disabled                                                                          
                        time:   [6.7507 s 6.8234 s 6.9544 s]

Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

I tried some optimizations in the trie code, however, I think we will never get to the same level as the state cache. There is just too much "happening" when using the trie. So, I will try to rewrite the state cache using overlays.

@bkchr
Copy link
Member Author

bkchr commented Nov 5, 2021

State access/with 128MB state cache                                                                          
                        time:   [319.20 ms 327.98 ms 344.54 ms]
                        change: [-11.082% -4.6915% +1.9686%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 3 outliers among 10 measurements (30.00%)
  1 (10.00%) low mild
  2 (20.00%) high severe
Benchmarking State access/with 128MB trie cache: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 44.6s.
State access/with 128MB trie cache                                                                          
                        time:   [805.19 ms 812.33 ms 822.12 ms]
                        change: [-86.385% -86.050% -85.707%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking State access/with cache disabled: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 36.3s.
State access/with cache disabled                                                                          
                        time:   [815.38 ms 824.47 ms 834.81 ms]
                        change: [-91.668% -91.414% -91.220%] (p = 0.00 < 0.05)
                        Performance has improved.

That looks quite good! (Ignore that without cache is also faster, because it currently also uses this new cache).

However, I think that is good enough to switch to it. We can maybe do some more optimizations in the future, but for now this should be good enough.

@zqhxuyuan
Copy link
Contributor

from the latest benchmark result of time data, seems newly state cache is more optimized than trie cache, so the final choice is newly state cache?

@bkchr
Copy link
Member Author

bkchr commented Nov 30, 2021

The state cache isn't new. This is the old/current implementation. I have some newer benchmarks and they look even better for the new implementation.

@bkchr
Copy link
Member Author

bkchr commented Dec 11, 2021

Latest benchmark results:

Benchmarking State access multiple values/with state cache and reading each key once: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 8.8s, or reduce sample count to 10.
State access multiple values/with state cache and reading each key once                                                                          
                        time:   [404.71 ms 405.86 ms 407.10 ms]
                        change: [-2.5741% -1.8305% -1.1932%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking State access multiple values/with trie node cache and reading each key once: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 154.3s, or reduce sample count to 10.
State access multiple values/with trie node cache and reading each key once                                                                          
                        time:   [383.66 ms 393.08 ms 409.82 ms]
                        change: [-3.0754% +0.0165% +4.1128%] (p = 1.00 > 0.05)
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
Benchmarking State access multiple values/with trie node cache (without fast cache) and reading each key once: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 136.0s, or reduce sample count to 10.
State access multiple values/with trie node cache (without fast cache) and reading each key once                                                                          
                        time:   [1.4344 s 1.4361 s 1.4380 s]
                        change: [-2.6102% -1.8309% -1.0828%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Benchmarking State access multiple values/no cache and reading each key once: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 131.8s, or reduce sample count to 10.
State access multiple values/no cache and reading each key once                                                                          
                        time:   [6.5531 s 6.5810 s 6.6072 s]
                        change: [-6.2601% -5.3016% -4.3704%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking State access multiple values/with state cache and reading 4 times each key: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 32.2s, or reduce sample count to 10.
State access multiple values/with state cache and reading 4 times each key                                                                          
                        time:   [1.6155 s 1.6235 s 1.6306 s]
                        change: [-0.3501% +0.2337% +0.8051%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) low mild
Benchmarking State access multiple values/with trie node cache and reading 4 times each key: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 175.8s, or reduce sample count to 10.
State access multiple values/with trie node cache and reading 4 times each key                                                                          
                        time:   [1.5609 s 1.5674 s 1.5746 s]
                        change: [-26.240% -14.527% -2.5934%] (p = 0.04 < 0.05)
                        Performance has improved.
Benchmarking State access multiple values/with trie node cache (without fast cache) and reading 4 times each key: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 230.4s, or reduce sample count to 10.
State access multiple values/with trie node cache (without fast cache) and reading 4 times each key                                                                          
                        time:   [5.6453 s 5.6689 s 5.6943 s]
                        change: [-3.6856% -2.8017% -1.9841%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking State access multiple values/no cache and reading 4 times each key: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 531.6s, or reduce sample count to 10.
State access multiple values/no cache and reading 4 times each key                                                                          
                        time:   [26.184 s 26.290 s 26.391 s]
                        change: [-10.544% -6.9626% -4.6803%] (p = 0.00 < 0.05)
                        Performance has improved.

State access single value/with state cache and reading the key once                                                                             
                        time:   [459.22 ns 460.13 ns 460.91 ns]
                        change: [-4.2646% -3.5852% -2.8996%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
State access single value/with trie node cache and reading the key once                                                                             
                        time:   [534.99 ns 537.11 ns 539.49 ns]
                        change: [-0.1012% +0.4546% +1.0906%] (p = 0.14 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
State access single value/with trie node cache (without fast cache) and reading the key once                                                                             
                        time:   [999.81 ns 1.0024 us 1.0054 us]
                        change: [-1.9941% -1.5677% -1.1460%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe
State access single value/no cache and reading the key once                                                                             
                        time:   [4.2824 us 4.3202 us 4.3615 us]
                        change: [-3.3785% -2.7279% -2.0472%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
State access single value/with state cache and reading 4 times the key                                                                             
                        time:   [825.88 ns 826.69 ns 827.37 ns]
                        change: [+1.8542% +2.2619% +2.6606%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  7 (7.00%) low mild
State access single value/with trie node cache and reading 4 times the key                                                                             
                        time:   [931.92 ns 934.56 ns 938.28 ns]
                        change: [+2.3531% +2.9439% +3.5274%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  7 (7.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
State access single value/with trie node cache (without fast cache) and reading 4 times the key                                                                             
                        time:   [2.7984 us 2.8077 us 2.8189 us]
                        change: [-5.2103% -4.9443% -4.6644%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
State access single value/no cache and reading 4 times the key                                                                             
                        time:   [15.856 us 15.962 us 16.075 us]
                        change: [-2.7102% -2.1355% -1.6064%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants