Enable in-memory trie when state sync #10820

staffik · 2024-03-18T13:44:56Z

Issue: #10564

Summary
Adds logic to load / unload in-memory tries that works with state sync. Enables in-memory trie with single shard tracking.

Changes

Add optional state_root parameter for memtrie loading logic - it's needed when we cannot read the state root from chunk extra.
Add load_mem_tries_for_tracked_shards config parameter.
Add methods for loading / unloading in-memory tries.
Remove obsolete tries from memory before each new state sync.

Follow up tasks

Add shard assignment shuffling every epoch for StatelessNet: [stateless_validation] Shard assignment shuffling for StatelessNet #10845.
Make sure the logic works well with resharding, and state is fully GC-ed, add integration tests: [P2][stateless_validation] In-memory trie with resharding #10844.

codecov · 2024-03-18T14:07:00Z

Codecov Report

Attention: Patch coverage is 69.07895% with 47 lines in your changes are missing coverage. Please review.

Project coverage is 71.52%. Comparing base (c2f9695) to head (9360578).
Report is 4 commits behind head on master.

Files	Patch %	Lines
core/store/src/trie/shard_tries.rs	71.42%	8 Missing and 2 partials ⚠️
chain/chain/src/runtime/mod.rs	55.00%	9 Missing ⚠️
chain/chain/src/test_utils/kv_runtime.rs	10.00%	9 Missing ⚠️
chain/client/src/client_actions.rs	0.00%	7 Missing ⚠️
chain/client/src/sync_jobs_actions.rs	0.00%	3 Missing ⚠️
chain/chain/src/chain.rs	87.50%	0 Missing and 2 partials ⚠️
core/store/src/trie/mem/loading.rs	71.42%	1 Missing and 1 partial ⚠️
integration-tests/src/genesis_helpers.rs	0.00%	2 Missing ⚠️
chain/client/src/client.rs	94.11%	0 Missing and 1 partial ⚠️
tools/fork-network/src/cli.rs	0.00%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10820      +/-   ##
==========================================
- Coverage   71.65%   71.52%   -0.13%     
==========================================
  Files         758      759       +1     
  Lines      151950   151585     -365     
  Branches   151950   151585     -365     
==========================================
- Hits       108880   108428     -452     
- Misses      38533    38665     +132     
+ Partials     4537     4492      -45

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (+<0.01%)`	⬆️
db-migration	`0.24% <0.00%> (+<0.01%)`	⬆️
genesis-check	`1.43% <0.69%> (+<0.01%)`	⬆️
integration-tests	`37.09% <63.15%> (-0.26%)`	⬇️
linux	`70.00% <68.42%> (-0.12%)`	⬇️
linux-nightly	`71.02% <69.07%> (-0.14%)`	⬇️
macos	`54.45% <59.33%> (-0.16%)`	⬇️
pytests	`1.66% <0.69%> (+<0.01%)`	⬆️
sanity-checks	`1.44% <0.69%> (+<0.01%)`	⬆️
unittests	`67.16% <60.00%> (-0.14%)`	⬇️
upgradability	`0.29% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wacban · 2024-03-21T15:50:09Z

@staffik It's marked as draft, please open it when it's ready for review.

chain/chain/src/garbage_collection.rs

wacban · 2024-03-22T07:26:38Z

@VanBarbascu Can you have a look as well?

core/store/src/config.rs

shreyan-gupta

Nice! Left a couple of comments

chain/chain/src/runtime/mod.rs

chain/client/src/client.rs

chain/chunks/src/logic.rs

chain/client/src/client_actions.rs

chain/epoch-manager/src/adapter.rs

chain/chain/src/chain.rs

core/store/src/trie/shard_tries.rs

wacban

Can you make sure the following cases are covered and can you add some integration or nayduck tests for it?

node needs to state sync after being offline for too long
node needs to state sync to get the shard tracked next epoch
node is restarted in the middle of state sync
node is restarted during catchup
node is restarted after state sync

chain/chain/src/resharding.rs

chain/client/src/client.rs

core/store/src/trie/shard_tries.rs

chain/client/src/client.rs

wacban

Can you also add some debug logs and metrics for loading and unloading? This has the potential to be something to be under scrutiny in the future :)

core/store/src/config.rs

core/store/src/trie/shard_tries.rs

GC fix fix fix revert fix fix Fix

robin-near · 2024-03-27T16:01:57Z

chain/chain/src/chain.rs

@@ -2770,6 +2783,8 @@ impl Chain {
            );
            store_update.commit()?;
            flat_storage_manager.create_flat_storage_for_shard(shard_uid).unwrap();
+            // Flat storage is ready, load memtrie if it is enabled.


Is this going to execute synchronously? If so, does this happen only during startup?

If this happens not just during startup, is there a way to run this in a different thread? I don't think it is acceptable to pause the chain for 2 minutes.

chain/chunks/src/logic.rs

robin-near · 2024-03-27T16:12:16Z

chain/client/src/client.rs

+                    .iter()
+                    .map(|id| self.epoch_manager.shard_id_to_uid(*id, &epoch_id).unwrap())
+                    .collect();
+                self.runtime_adapter.retain_mem_tries(&shard_uids);


If for this epoch we track shard 1, and for the next epoch we track 2, we would be starting a state sync for shard 2, but here we would still retain [1, 2], right? So that wouldn't actually unload shard 2?

What I'm saying is that we should unload shard 2.

That can happen if shard 2 was tracked for previous epoch, it is not tracked for this epoch, and it is tracked for the next epoch. Then indeed shard 2 will not be unloaded by retain_mem_tries. That's why we have unload_mem_trie that will unload shard 2 before we apply state parts for shard 2 during state sync in this epoch.

chain/chunks/src/logic.rs

robin-near · 2024-03-27T16:17:40Z

chain/client/src/sync_jobs_actions.rs

@@ -81,6 +81,9 @@ impl SyncJobsActions {
    }

    pub fn handle_apply_state_parts_request(&mut self, msg: ApplyStatePartsRequest) {
+        // Unload mem-trie (in case it is still loaded) before we apply state parts.
+        msg.runtime_adapter.unload_mem_trie(&msg.shard_uid);


How does this interact with the retain_mem_tries earlier? Do I understand this correctly:

If in epoch T-1 we track 1, in epoch T we track 2, and in epoch T+1 we track 3, then in epoch T, we retain [2, 3] (unload 1), and here in state sync we unload 3 (in case it's still loaded)

If in epoch T-1 we track 1, in epoch T we track 2, and in epoch T+1 we track 1, then in epoch T, we retain [2, 1] (unload nothing), and here in state sync we unload 1

If that understanding is correct, would it work instead to just call retain_mem_tries using only the "this epoch cares" shards, and then omit the unload_mem_trie call here? Or is there some issue with that?

Your understanding is correct. We need unload_mem_trie for situation described in #10820 (comment).

core/store/src/trie/config.rs

**Context** Issue: #10982 Follow up to: #10820. Modifies StateSync state machine so that memtrie load happens asynchronously on catchup. **Summary** * Split `chain.set_state_finalize()` into: * `create_flat_storage_for_shard()` * `schedule_load_memtrie()` * actual `set_state_finalize()` * ^ we need it because creating flat storage and state finalize requires `chain` which cannot be passed in a message to the separate thread. * Code to trigger memtrie load in a separate thread, analogously to how apply state parts is done. * Modify shard sync stages: * `StateDownloadScheduling` --> `StateApplyScheduling` * Just changed the name as it was confusing. What happens there is scheduling applying of state parts. * `StateDownloadApplying` --> `StateApplyComplete` * What it actually did before was initializing flat storage and finalizing state update after state apply from previous stage. * Now it only initializes flat storage and schedules memtrie loading. * `StateDownloadComplete` --> `StateApplyFinalizing` * Before it was just deciding what next stage to transit into. * Now it also contains the finalizing state update logic that was previously in the previous stage. Integration tests are to be done as a part of: #10844.

staffik added the A-stateless-validation Area: stateless validation label Mar 18, 2024

staffik requested review from wacban and robin-near March 21, 2024 00:22

staffik force-pushed the memtrie-integration-2 branch from 60650ef to 60b6f2f Compare March 21, 2024 13:07

robin-near reviewed Mar 21, 2024

View reviewed changes

chain/chain/src/garbage_collection.rs Outdated Show resolved Hide resolved

chain/chain/src/garbage_collection.rs Outdated Show resolved Hide resolved

chain/chain/src/garbage_collection.rs Outdated Show resolved Hide resolved

staffik force-pushed the memtrie-integration-2 branch from 128a3ea to c4f0908 Compare March 21, 2024 23:03

staffik marked this pull request as ready for review March 21, 2024 23:04

staffik requested a review from a team as a code owner March 21, 2024 23:04

staffik changed the title ~~[Draft] Memtrie integration~~ Enable in-memory trie for state sync Mar 21, 2024

staffik changed the title ~~Enable in-memory trie for state sync~~ Enable in-memory trie when state sync Mar 21, 2024

staffik requested a review from robin-near March 22, 2024 06:25

staffik force-pushed the memtrie-integration-2 branch from c4f0908 to 5709629 Compare March 22, 2024 07:05

wacban requested a review from VanBarbascu March 22, 2024 07:26

staffik force-pushed the memtrie-integration-2 branch 3 times, most recently from d11728c to 3ab877d Compare March 22, 2024 10:03

Longarithm reviewed Mar 22, 2024

View reviewed changes

core/store/src/config.rs Outdated Show resolved Hide resolved

shreyan-gupta requested changes Mar 22, 2024

View reviewed changes

wacban reviewed Mar 25, 2024

View reviewed changes

chain/chain/src/resharding.rs Outdated Show resolved Hide resolved

chain/client/src/client.rs Outdated Show resolved Hide resolved

core/store/src/trie/shard_tries.rs Show resolved Hide resolved

VanBarbascu reviewed Mar 25, 2024

View reviewed changes

chain/client/src/client.rs Outdated Show resolved Hide resolved

wacban reviewed Mar 25, 2024

View reviewed changes

core/store/src/config.rs Outdated Show resolved Hide resolved

core/store/src/trie/shard_tries.rs Show resolved Hide resolved

shreyan-gupta approved these changes Mar 25, 2024

View reviewed changes

staffik force-pushed the memtrie-integration-2 branch from 6eac8cd to c256de7 Compare March 26, 2024 14:44

staffik added 4 commits March 26, 2024 23:31

Memtrie load v2

1c469ec

GC fix fix fix revert fix fix Fix

nit fixes

0c4c68c

Unload mem-trie elsewhere

ec05a81

Fix

7228fbf

Load memtries tracked shards at startup

0fe2430

staffik force-pushed the memtrie-integration-2 branch from ebcf133 to 0fe2430 Compare March 26, 2024 22:31

robin-near reviewed Mar 27, 2024

View reviewed changes

Remove load_mem_tries_for_all_shards

9360578

staffik added this pull request to the merge queue Mar 27, 2024

Merged via the queue into master with commit d4d1b82 Mar 27, 2024
29 of 31 checks passed

staffik deleted the memtrie-integration-2 branch March 27, 2024 20:34

staffik mentioned this pull request Apr 9, 2024

[stateless_validation] Load memtrie async on catchup #10983

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable in-memory trie when state sync #10820

Enable in-memory trie when state sync #10820

staffik commented Mar 18, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024 •

edited

Loading

wacban commented Mar 21, 2024

wacban commented Mar 22, 2024

shreyan-gupta left a comment

wacban left a comment

wacban left a comment

robin-near Mar 27, 2024

robin-near Mar 27, 2024

staffik Mar 27, 2024

robin-near Mar 27, 2024

staffik Mar 27, 2024

robin-near Mar 27, 2024

staffik Mar 27, 2024 •

edited

Loading

Enable in-memory trie when state sync #10820

Enable in-memory trie when state sync #10820

Conversation

staffik commented Mar 18, 2024 • edited Loading

codecov bot commented Mar 18, 2024 • edited Loading

Codecov Report

wacban commented Mar 21, 2024

wacban commented Mar 22, 2024

shreyan-gupta left a comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

robin-near Mar 27, 2024

Choose a reason for hiding this comment

robin-near Mar 27, 2024

Choose a reason for hiding this comment

staffik Mar 27, 2024

Choose a reason for hiding this comment

robin-near Mar 27, 2024

Choose a reason for hiding this comment

staffik Mar 27, 2024

Choose a reason for hiding this comment

robin-near Mar 27, 2024

Choose a reason for hiding this comment

staffik Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

staffik commented Mar 18, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024 •

edited

Loading

staffik Mar 27, 2024 •

edited

Loading