-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add state sync support #7166
Add state sync support #7166
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7166 +/- ##
==========================================
- Coverage 54.84% 54.49% -0.35%
==========================================
Files 577 430 -147
Lines 39629 31375 -8254
==========================================
- Hits 21733 17098 -4635
+ Misses 16100 12870 -3230
+ Partials 1796 1407 -389 |
I have few architecture questions (sorry, I'm still learning the details about the system):
|
Shouldn't we somehow let the user decide from which snapshot he wants to sync? For example Algorands' Fast Catchup requires a snapshot ID / hash when starting a node with a fast catchap mode. https://developer.algorand.org/docs/run-a-node/setup/install/#sync-node-network-using-fast-catchup |
Since snapshots are pre-generated on a fixed schedule, we always know ahead of time how many chunks a snapshot has. This is an optimization to speed up restores, since actually taking the snapshot is the time-consuming part, so fetching and restoring a pre-generated snapshot is much faster. In the initial protocol, each chunk had a boolean indicating that it was the final chunk, but users complained that this was unnecessary and confusing (since we already give the number of chunks in the snapshot metadata), so it was dropped.
Tendermint does the light client verification after the snapshot has been restored, by checking the app hash against the on-chain app hash. You can read some more details on this from the Tendermint side here: https://docs.tendermint.com/master/spec/abci/apps.html#state-sync Note that we don't do any incremental verification of chunks against the app hash, as this was considered out of scope for an initial version. Since Cosmos Hub restores take about 30 seconds anyway, erroring early doesn't really save that much time.
The light client is given a root of trust (a height and commit hash), and then finds the trusted on-chain app hash for the snapshot height based on that via untrusted RPC servers. Details on the light client protocol here:
In the current protocol, Tendermint does snapshot discovery and picks the snapshot that seems "best". We could have the user find and specify a specific snapshot instead, but that would need e.g. block explorer interfaces to expose those snapshots to the user, and make it harder to use. Open to discussing this though, feel free to open a Tendermint issue with a proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all updates. good job.
I've checked in Cosmos SDK about error handling - there is a type for that and we just need to add a code for logic errors and maybe tendermint codespace. Please look at the comment I left.
Thanks for reviewing! I think I've addressed all of your comments apart from the error handling (see separate response), is there anything else outstanding? |
@erikgrinaker Looks very good -- let's close the parts about errors (I've just responded to it) and hashing. |
I believe @alexanderbez is reviewing this as well, so I'll wait for his approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK. Overall, looks solid! I left a few minor remarks, but otherwise it LGTM 👍
This seems so close to merging that I'm excited to have it |
Lets fix the merge conflicts and get this bad boy merged. |
Awesome, thanks for reviewing, I know it was a fairly big chunk! Will resolve and merge first thing tomorrow! |
* Add state sync support * fix incorrect test tempdir * proto: move and update Protobuf schemas * proto: lint fixes * comment tweaks * don't use type aliasing * don't call .Error() when logging errors * use create terminology instead of take for snapshots * reuse chunk hasher * simplify key encoding code * track chunk index in Manager * add restoreDone message for Manager * add a ready channel to Snapshotter.Restore() * add comment on streaming IO API * use sdkerrors for error handling * fix incorrect error * tweak changelog * syntax fix * update test code after merge
Fixes #5689, fixes #5690. Adds support for taking, restoring, and serving snapshots for state sync, as outlined in ADR-053 and the ABCI reference.
rootmulti.Store
now implements a newsnapshots.Snapshotter
interface, which can take and restore binary state snapshots at a given height. Snapshots are built as follows:rootmulti.Store
iterates over alliavl.Store
stores, and for each one emits a ProtobufSnapshotStoreItem
message with store metadataFor each
iavl.Store
, exportiavl.ExportNode
nodes viaExport()
at the given height and emit a ProtobufSnapshotIAVLItem
message for each nodeThe length-prefix framed Protobuf message stream is zlib-compressed
The zlib-compressed stream is chunked naïvely into 10 MB chunks
snapshots.Store
stores snapshot metadata in a separate database and chunks in a filesystem directoryBaseApp
takes asynchronous snapshots afterCommit()
in regular height intervals given by the config optionstate-sync.snapshot-interval
(default 0, i.e. disabled). It also prunes old snapshots, and keeps the most recent snapshots given by the optionstate-sync.snapshot-keep-recent
(default 2). The snapshot interval must be a multiple ofpruning-keep-every
, to make sure IAVL versions aren't deleted while the height is being snapshotted.BaseApp
implements the state sync ABCI interface, for fetching and applying state snapshots.SimApp
has been updated to enable and initialize state sync snapshotting.Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.
docs/
) or specification (x/<module>/spec/
)godoc
comments.Unreleased
section inCHANGELOG.md
Files changed
in the Github PR explorerCodecov Report
in the comment section below once CI passes