-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable in-memory trie when state sync #10820
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #10820 +/- ##
==========================================
- Coverage 71.65% 71.52% -0.13%
==========================================
Files 758 759 +1
Lines 151950 151585 -365
Branches 151950 151585 -365
==========================================
- Hits 108880 108428 -452
- Misses 38533 38665 +132
+ Partials 4537 4492 -45
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
60650ef
to
60b6f2f
Compare
@staffik It's marked as draft, please open it when it's ready for review. |
128a3ea
to
c4f0908
Compare
c4f0908
to
5709629
Compare
@VanBarbascu Can you have a look as well? |
d11728c
to
3ab877d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Left a couple of comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure the following cases are covered and can you add some integration or nayduck tests for it?
- node needs to state sync after being offline for too long
- node needs to state sync to get the shard tracked next epoch
- node is restarted in the middle of state sync
- node is restarted during catchup
- node is restarted after state sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add some debug logs and metrics for loading and unloading? This has the potential to be something to be under scrutiny in the future :)
6eac8cd
to
c256de7
Compare
GC fix fix fix revert fix fix Fix
ebcf133
to
0fe2430
Compare
@@ -2770,6 +2783,8 @@ impl Chain { | |||
); | |||
store_update.commit()?; | |||
flat_storage_manager.create_flat_storage_for_shard(shard_uid).unwrap(); | |||
// Flat storage is ready, load memtrie if it is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to execute synchronously? If so, does this happen only during startup?
If this happens not just during startup, is there a way to run this in a different thread? I don't think it is acceptable to pause the chain for 2 minutes.
.iter() | ||
.map(|id| self.epoch_manager.shard_id_to_uid(*id, &epoch_id).unwrap()) | ||
.collect(); | ||
self.runtime_adapter.retain_mem_tries(&shard_uids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If for this epoch we track shard 1, and for the next epoch we track 2, we would be starting a state sync for shard 2, but here we would still retain [1, 2], right? So that wouldn't actually unload shard 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm saying is that we should unload shard 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can happen if shard 2 was tracked for previous epoch, it is not tracked for this epoch, and it is tracked for the next epoch. Then indeed shard 2 will not be unloaded by retain_mem_tries
. That's why we have unload_mem_trie
that will unload shard 2 before we apply state parts for shard 2 during state sync in this epoch.
@@ -81,6 +81,9 @@ impl SyncJobsActions { | |||
} | |||
|
|||
pub fn handle_apply_state_parts_request(&mut self, msg: ApplyStatePartsRequest) { | |||
// Unload mem-trie (in case it is still loaded) before we apply state parts. | |||
msg.runtime_adapter.unload_mem_trie(&msg.shard_uid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this interact with the retain_mem_tries earlier? Do I understand this correctly:
- If in epoch T-1 we track 1, in epoch T we track 2, and in epoch T+1 we track 3, then in epoch T, we retain [2, 3] (unload 1), and here in state sync we unload 3 (in case it's still loaded)
- If in epoch T-1 we track 1, in epoch T we track 2, and in epoch T+1 we track 1, then in epoch T, we retain [2, 1] (unload nothing), and here in state sync we unload 1
If that understanding is correct, would it work instead to just call retain_mem_tries using only the "this epoch cares" shards, and then omit the unload_mem_trie call here? Or is there some issue with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your understanding is correct. We need unload_mem_trie
for situation described in #10820 (comment).
**Context** Issue: #10982 Follow up to: #10820. Modifies StateSync state machine so that memtrie load happens asynchronously on catchup. **Summary** * Split `chain.set_state_finalize()` into: * `create_flat_storage_for_shard()` * `schedule_load_memtrie()` * actual `set_state_finalize()` * ^ we need it because creating flat storage and state finalize requires `chain` which cannot be passed in a message to the separate thread. * Code to trigger memtrie load in a separate thread, analogously to how apply state parts is done. * Modify shard sync stages: * `StateDownloadScheduling` --> `StateApplyScheduling` * Just changed the name as it was confusing. What happens there is scheduling applying of state parts. * `StateDownloadApplying` --> `StateApplyComplete` * What it actually did before was initializing flat storage and finalizing state update after state apply from previous stage. * Now it only initializes flat storage and schedules memtrie loading. * `StateDownloadComplete` --> `StateApplyFinalizing` * Before it was just deciding what next stage to transit into. * Now it also contains the finalizing state update logic that was previously in the previous stage. Integration tests are to be done as a part of: #10844.
Issue: #10564
Summary
Adds logic to load / unload in-memory tries that works with state sync. Enables in-memory trie with single shard tracking.
Changes
state_root
parameter for memtrie loading logic - it's needed when we cannot read the state root from chunk extra.load_mem_tries_for_tracked_shards
config parameter.Follow up tasks