Make use of state cache when building blocks for a parachain #9770

bkchr · 2021-09-13T11:49:23Z

When we are building blocks for a parachain, we also need to collect a state proof. This means that we will trigger a different code path that currently doesn't support the state cache when we build a block for a parachain.

This code path is taken when building a parachain block:
https://github.com/paritytech/substrate/blob/master/client/service/src/client/call_executor.rs#L234-L260

And this particular call: https://github.com/paritytech/substrate/blob/master/client/service/src/client/call_executor.rs#L234 (as_trie_backend) prevents that we use the state cache.

cheme · 2021-09-14T09:41:35Z

For caching accesses during block building, I think the easiest model would be to have a trie node cache.
The issue with storage cache is that we need to access trie node at least once to build the proof.
Trie node cache is very close to a db cache, could be a simple lru put in front of the database state column.

bkchr · 2021-09-14T10:01:55Z

Good idea! We should implement an early version and compare that against the state cache. Even if there is a small performance hit,I think we should switch to a trie level cache. The point here being is that parachains currently can not use the state cache for block building and the state cache is rather complicated.

Another bonus point of a trie cache would be that we would solve #9769 as well.

cheme · 2021-09-14T16:30:20Z

Drafted such cache in this branch : master...cheme:trie_state_cache

bkchr · 2021-09-16T10:00:34Z

@cheme do we have a benchmark for comparing state-cache vs trie-cache?

cheme · 2021-09-16T10:10:55Z

In theory on my branch switching values in https://github.com/cheme/substrate/blob/b23b1a11b17e480d8d4ade4c77f87ffddd1979a1/bin/node/testing/src/bench.rs#L381 should apply to bench, but I am not familiar enough with the bench to say.
Also I started to look on a polkadot build, but just end up fixing bugs and errors.
Just syncing show nothing (network limits), but maybe I should try to synch from a block export to get a rough idea.

bkchr · 2021-09-16T11:49:00Z

Can you try to run it and compare it against master?

cheme · 2021-09-16T11:50:19Z

I will check syncing this afternoon, but would be good a bench specialist points to relevant benchmarks.

bkchr · 2021-09-16T11:51:35Z

@arkpar do you know which benchmark we could use for this?

arkpar · 2021-09-16T15:31:50Z

There's no benchmark specifically for the block building that I'm aware of. We could try replacing the storage cache with the trie cache and simply compare importing 100k blocks of an existing parachain with an import command.

cheme · 2021-09-16T16:53:47Z

I did run 100k first blocks polkadot import, but I did not really see difference (on my pc it's about 3:45min for all scenario (even for no cache at all)). So I guess first 100k blocks are being a bit empty.
I also try runing
cargo run --release -p node-bench -- ::node::import::native::sr25519::transfer_keep_alive::paritydb::small --json
and on a modified version that remove all access to storage cache when size at 0 (otherwhise we keep accessing the local cache), I obtain:

Trie only: [{"name":"Block import (RandomTransfersKeepAlive/small, Native, ParityDb backend)","raw_average":5474621,"average":5441007}]
Storage cache: [{"name":"Block import (RandomTransfersKeepAlive/small, Native, ParityDb backend)","raw_average":6145274,"average":6097683}]
No cache: [{"name":"Block import (RandomTransfersKeepAlive/small, Native, ParityDb backend)","raw_average":6502472,"average":6441670}]

But I did run the bench locally and there is strong chance I may have interfere with it.
So could make sense running more of those node_bench for different settings:
9d5e011
6375499
b5e4111
bc98fa6
ab760ea
but it is quite time consuming.

bkchr · 2021-10-01T09:46:11Z

@cheme could you prepare a polkadot branch? And then we do multiple syncs with state-cache/without/with trie-cache?

The state-cache issues are getting more and more pressing, so I would like to switch as fast as possible. However, if you are currently to overwhelmed by the trie migration stuff, I will find someone else.

cheme · 2021-10-01T09:47:59Z

That's good, trie stuff is a bit pending review, and I think it will be mainly updating the branch from my previous message.
Will be my next next task :)

bkchr · 2021-10-01T11:16:53Z

That's good, trie stuff is a bit pending review, and I think it will be mainly updating the branch from my previous message. Will be my next next task :)

What is good? So should I ask someone else? This needs to be done asap

cheme · 2021-10-01T12:00:05Z

I mean I will prepare a branch (today).

cheme · 2021-10-01T13:55:06Z

@bkchr this polkadot branch is usable: https://github.com/cheme/polkadot-1/tree/trie_state_cache
polkadot --trie-cache-size 0 --state-cache-size 67108864 to use state cache.
polkadot --trie-cache-size 67108864 --state-cache-size 0 to use trie cache (the one we want tested).

One possible cause of performance regression could be the absence of storage_hash cache (especially when hashing wasm). So testing with
polkadot --trie-cache-size 1 --state-cache-size 67108864 can also be interesting (storage_hash cache is statically sized and having 1 byte cache will reactivate the state cache so it will be active).

bkchr · 2021-10-31T22:24:01Z

I have written a benchmark:

Benchmarking State access/with 128MB state cache: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 81.8s.
State access/with 128MB state cache                                                                          
                        time:   [362.44 ms 393.31 ms 434.62 ms]

Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe
Benchmarking State access/with 128MB trie cache: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 57.4s.
State access/with 128MB trie cache                                                                          
                        time:   [5.4999 s 5.6672 s 5.8909 s]

Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking State access/with cache disabled: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 67.7s.
State access/with cache disabled                                                                          
                        time:   [6.7507 s 6.8234 s 6.9544 s]

Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

I tried some optimizations in the trie code, however, I think we will never get to the same level as the state cache. There is just too much "happening" when using the trie. So, I will try to rewrite the state cache using overlays.

bkchr · 2021-11-05T00:28:13Z

State access/with 128MB state cache                                                                          
                        time:   [319.20 ms 327.98 ms 344.54 ms]
                        change: [-11.082% -4.6915% +1.9686%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 3 outliers among 10 measurements (30.00%)
  1 (10.00%) low mild
  2 (20.00%) high severe
Benchmarking State access/with 128MB trie cache: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 44.6s.
State access/with 128MB trie cache                                                                          
                        time:   [805.19 ms 812.33 ms 822.12 ms]
                        change: [-86.385% -86.050% -85.707%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking State access/with cache disabled: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 36.3s.
State access/with cache disabled                                                                          
                        time:   [815.38 ms 824.47 ms 834.81 ms]
                        change: [-91.668% -91.414% -91.220%] (p = 0.00 < 0.05)
                        Performance has improved.

That looks quite good! (Ignore that without cache is also faster, because it currently also uses this new cache).

However, I think that is good enough to switch to it. We can maybe do some more optimizations in the future, but for now this should be good enough.

zqhxuyuan · 2021-11-30T04:09:25Z

from the latest benchmark result of time data, seems newly state cache is more optimized than trie cache, so the final choice is newly state cache?

bkchr · 2021-11-30T08:51:58Z

The state cache isn't new. This is the old/current implementation. I have some newer benchmarks and they look even better for the new implementation.

bkchr · 2021-12-11T15:12:22Z

Latest benchmark results:

Benchmarking State access multiple values/with state cache and reading each key once: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 8.8s, or reduce sample count to 10.
State access multiple values/with state cache and reading each key once                                                                          
                        time:   [404.71 ms 405.86 ms 407.10 ms]
                        change: [-2.5741% -1.8305% -1.1932%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking State access multiple values/with trie node cache and reading each key once: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 154.3s, or reduce sample count to 10.
State access multiple values/with trie node cache and reading each key once                                                                          
                        time:   [383.66 ms 393.08 ms 409.82 ms]
                        change: [-3.0754% +0.0165% +4.1128%] (p = 1.00 > 0.05)
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
Benchmarking State access multiple values/with trie node cache (without fast cache) and reading each key once: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 136.0s, or reduce sample count to 10.
State access multiple values/with trie node cache (without fast cache) and reading each key once                                                                          
                        time:   [1.4344 s 1.4361 s 1.4380 s]
                        change: [-2.6102% -1.8309% -1.0828%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Benchmarking State access multiple values/no cache and reading each key once: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 131.8s, or reduce sample count to 10.
State access multiple values/no cache and reading each key once                                                                          
                        time:   [6.5531 s 6.5810 s 6.6072 s]
                        change: [-6.2601% -5.3016% -4.3704%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking State access multiple values/with state cache and reading 4 times each key: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 32.2s, or reduce sample count to 10.
State access multiple values/with state cache and reading 4 times each key                                                                          
                        time:   [1.6155 s 1.6235 s 1.6306 s]
                        change: [-0.3501% +0.2337% +0.8051%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) low mild
Benchmarking State access multiple values/with trie node cache and reading 4 times each key: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 175.8s, or reduce sample count to 10.
State access multiple values/with trie node cache and reading 4 times each key                                                                          
                        time:   [1.5609 s 1.5674 s 1.5746 s]
                        change: [-26.240% -14.527% -2.5934%] (p = 0.04 < 0.05)
                        Performance has improved.
Benchmarking State access multiple values/with trie node cache (without fast cache) and reading 4 times each key: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 230.4s, or reduce sample count to 10.
State access multiple values/with trie node cache (without fast cache) and reading 4 times each key                                                                          
                        time:   [5.6453 s 5.6689 s 5.6943 s]
                        change: [-3.6856% -2.8017% -1.9841%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking State access multiple values/no cache and reading 4 times each key: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 531.6s, or reduce sample count to 10.
State access multiple values/no cache and reading 4 times each key                                                                          
                        time:   [26.184 s 26.290 s 26.391 s]
                        change: [-10.544% -6.9626% -4.6803%] (p = 0.00 < 0.05)
                        Performance has improved.

State access single value/with state cache and reading the key once                                                                             
                        time:   [459.22 ns 460.13 ns 460.91 ns]
                        change: [-4.2646% -3.5852% -2.8996%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
State access single value/with trie node cache and reading the key once                                                                             
                        time:   [534.99 ns 537.11 ns 539.49 ns]
                        change: [-0.1012% +0.4546% +1.0906%] (p = 0.14 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
State access single value/with trie node cache (without fast cache) and reading the key once                                                                             
                        time:   [999.81 ns 1.0024 us 1.0054 us]
                        change: [-1.9941% -1.5677% -1.1460%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe
State access single value/no cache and reading the key once                                                                             
                        time:   [4.2824 us 4.3202 us 4.3615 us]
                        change: [-3.3785% -2.7279% -2.0472%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
State access single value/with state cache and reading 4 times the key                                                                             
                        time:   [825.88 ns 826.69 ns 827.37 ns]
                        change: [+1.8542% +2.2619% +2.6606%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  7 (7.00%) low mild
State access single value/with trie node cache and reading 4 times the key                                                                             
                        time:   [931.92 ns 934.56 ns 938.28 ns]
                        change: [+2.3531% +2.9439% +3.5274%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  7 (7.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
State access single value/with trie node cache (without fast cache) and reading 4 times the key                                                                             
                        time:   [2.7984 us 2.8077 us 2.8189 us]
                        change: [-5.2103% -4.9443% -4.6644%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
State access single value/no cache and reading 4 times the key                                                                             
                        time:   [15.856 us 15.962 us 16.075 us]
                        change: [-2.7102% -2.1355% -1.6064%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

bkchr added the I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. label Sep 13, 2021

bkchr mentioned this issue Oct 7, 2021

Block import error block sync paritytech/cumulus#607

Closed

bkchr self-assigned this Dec 20, 2021

bkchr mentioned this issue May 12, 2022

Introduce trie level cache and remove state cache #11407

Merged

paritytech-processbot bot closed this as completed in #11407 Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make use of state cache when building blocks for a parachain #9770

Make use of state cache when building blocks for a parachain #9770

bkchr commented Sep 13, 2021

cheme commented Sep 14, 2021

bkchr commented Sep 14, 2021

cheme commented Sep 14, 2021

bkchr commented Sep 16, 2021

cheme commented Sep 16, 2021

bkchr commented Sep 16, 2021

cheme commented Sep 16, 2021

bkchr commented Sep 16, 2021

arkpar commented Sep 16, 2021

cheme commented Sep 16, 2021

bkchr commented Oct 1, 2021

cheme commented Oct 1, 2021

bkchr commented Oct 1, 2021

cheme commented Oct 1, 2021 •

edited

Loading

cheme commented Oct 1, 2021 •

edited

Loading

bkchr commented Oct 31, 2021

bkchr commented Nov 5, 2021

zqhxuyuan commented Nov 30, 2021

bkchr commented Nov 30, 2021

bkchr commented Dec 11, 2021

Make use of state cache when building blocks for a parachain #9770

Make use of state cache when building blocks for a parachain #9770

Comments

bkchr commented Sep 13, 2021

cheme commented Sep 14, 2021

bkchr commented Sep 14, 2021

cheme commented Sep 14, 2021

bkchr commented Sep 16, 2021

cheme commented Sep 16, 2021

bkchr commented Sep 16, 2021

cheme commented Sep 16, 2021

bkchr commented Sep 16, 2021

arkpar commented Sep 16, 2021

cheme commented Sep 16, 2021

bkchr commented Oct 1, 2021

cheme commented Oct 1, 2021

bkchr commented Oct 1, 2021

cheme commented Oct 1, 2021 • edited Loading

cheme commented Oct 1, 2021 • edited Loading

bkchr commented Oct 31, 2021

bkchr commented Nov 5, 2021

zqhxuyuan commented Nov 30, 2021

bkchr commented Nov 30, 2021

bkchr commented Dec 11, 2021

cheme commented Oct 1, 2021 •

edited

Loading

cheme commented Oct 1, 2021 •

edited

Loading