Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add state sync support #7166

Merged
merged 23 commits into from
Sep 8, 2020
Merged

Add state sync support #7166

merged 23 commits into from
Sep 8, 2020

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Aug 25, 2020

Fixes #5689, fixes #5690. Adds support for taking, restoring, and serving snapshots for state sync, as outlined in ADR-053 and the ABCI reference.

  • rootmulti.Store now implements a new snapshots.Snapshotter interface, which can take and restore binary state snapshots at a given height. Snapshots are built as follows:

    1. rootmulti.Store iterates over all iavl.Store stores, and for each one emits a Protobuf SnapshotStoreItem message with store metadata

    2. For each iavl.Store, export iavl.ExportNode nodes via Export() at the given height and emit a Protobuf SnapshotIAVLItem message for each node

    3. The length-prefix framed Protobuf message stream is zlib-compressed

    4. The zlib-compressed stream is chunked naïvely into 10 MB chunks

  • snapshots.Store stores snapshot metadata in a separate database and chunks in a filesystem directory

  • BaseApp takes asynchronous snapshots after Commit() in regular height intervals given by the config option state-sync.snapshot-interval (default 0, i.e. disabled). It also prunes old snapshots, and keeps the most recent snapshots given by the option state-sync.snapshot-keep-recent (default 2). The snapshot interval must be a multiple of pruning-keep-every, to make sure IAVL versions aren't deleted while the height is being snapshotted.

  • BaseApp implements the state sync ABCI interface, for fetching and applying state snapshots.

  • SimApp has been updated to enable and initialize state sync snapshotting.


Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Targeted PR against correct branch (see CONTRIBUTING.md)
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Code follows the module structure standards.
  • Wrote unit and integration tests
  • Updated relevant documentation (docs/) or specification (x/<module>/spec/)
  • Added relevant godoc comments.
  • Added a relevant changelog entry to the Unreleased section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer
  • Review Codecov Report in the comment section below once CI passes

@codecov
Copy link

codecov bot commented Aug 25, 2020

Codecov Report

Merging #7166 into master will decrease coverage by 0.34%.
The diff coverage is 68.80%.

@@            Coverage Diff             @@
##           master    #7166      +/-   ##
==========================================
- Coverage   54.84%   54.49%   -0.35%     
==========================================
  Files         577      430     -147     
  Lines       39629    31375    -8254     
==========================================
- Hits        21733    17098    -4635     
+ Misses      16100    12870    -3230     
+ Partials     1796     1407     -389     

@fedekunze fedekunze requested review from anilcse and clevinson August 26, 2020 07:18
store/types/types.proto Outdated Show resolved Hide resolved
snapshots/types/types.proto Outdated Show resolved Hide resolved
snapshots/types/snapshotter.go Outdated Show resolved Hide resolved
snapshots/types/snapshotter.go Show resolved Hide resolved
baseapp/abci.go Outdated Show resolved Hide resolved
baseapp/abci.go Outdated Show resolved Hide resolved
baseapp/abci.go Show resolved Hide resolved
snapshots/manager.go Outdated Show resolved Hide resolved
snapshots/util.go Show resolved Hide resolved
snapshots/manager.go Outdated Show resolved Hide resolved
snapshots/manager.go Outdated Show resolved Hide resolved
snapshots/manager.go Show resolved Hide resolved
@robert-zaremba
Copy link
Collaborator

I have few architecture questions (sorry, I'm still learning the details about the system):

  1. Are we always able to correctly determine the number of chunks? Would it be better to have it dynamic (eg, have an async message saying chunk X is the last one)?
  2. Where is the lightclient check in the code (is it in this PR or it's in separate issue?)
    • How we know that lightclient is up-to-date with the data from the latest snapshot? Does lightclient implement the full tendermint consensus algorithm?

@robert-zaremba
Copy link
Collaborator

Shouldn't we somehow let the user decide from which snapshot he wants to sync? For example Algorands' Fast Catchup requires a snapshot ID / hash when starting a node with a fast catchap mode.

https://developer.algorand.org/docs/run-a-node/setup/install/#sync-node-network-using-fast-catchup

@erikgrinaker
Copy link
Contributor Author

Are we always able to correctly determine the number of chunks? Would it be better to have it dynamic (eg, have an async message saying chunk X is the last one)?

Since snapshots are pre-generated on a fixed schedule, we always know ahead of time how many chunks a snapshot has. This is an optimization to speed up restores, since actually taking the snapshot is the time-consuming part, so fetching and restoring a pre-generated snapshot is much faster.

In the initial protocol, each chunk had a boolean indicating that it was the final chunk, but users complained that this was unnecessary and confusing (since we already give the number of chunks in the snapshot metadata), so it was dropped.

Where is the lightclient check in the code (is it in this PR or it's in separate issue?)

Tendermint does the light client verification after the snapshot has been restored, by checking the app hash against the on-chain app hash. You can read some more details on this from the Tendermint side here:

https://docs.tendermint.com/master/spec/abci/apps.html#state-sync

Note that we don't do any incremental verification of chunks against the app hash, as this was considered out of scope for an initial version. Since Cosmos Hub restores take about 30 seconds anyway, erroring early doesn't really save that much time.

How we know that lightclient is up-to-date with the data from the latest snapshot? Does lightclient implement the full tendermint consensus algorithm?

The light client is given a root of trust (a height and commit hash), and then finds the trusted on-chain app hash for the snapshot height based on that via untrusted RPC servers. Details on the light client protocol here:

https://medium.com/tendermint/everything-you-need-to-know-about-the-tendermint-light-client-f80d03856f98

Shouldn't we somehow let the user decide from which snapshot he wants to sync? For example Algorands' Fast Catchup requires a snapshot ID / hash when starting a node with a fast catchap mode.

In the current protocol, Tendermint does snapshot discovery and picks the snapshot that seems "best". We could have the user find and specify a specific snapshot instead, but that would need e.g. block explorer interfaces to expose those snapshots to the user, and make it harder to use. Open to discussing this though, feel free to open a Tendermint issue with a proposal.

Copy link
Collaborator

@robert-zaremba robert-zaremba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all updates. good job.

I've checked in Cosmos SDK about error handling - there is a type for that and we just need to add a code for logic errors and maybe tendermint codespace. Please look at the comment I left.

snapshots/manager.go Outdated Show resolved Hide resolved
@erikgrinaker
Copy link
Contributor Author

Thanks for reviewing! I think I've addressed all of your comments apart from the error handling (see separate response), is there anything else outstanding?

@robert-zaremba
Copy link
Collaborator

Thanks for reviewing! I think I've addressed all of your comments apart from the error handling (see separate response), is there anything else outstanding?

@erikgrinaker Looks very good -- let's close the parts about errors (I've just responded to it) and hashing.

@erikgrinaker
Copy link
Contributor Author

I believe @alexanderbez is reviewing this as well, so I'll wait for his approval.

@amaury1093 amaury1093 added this to the v0.40 [Stargate] milestone Sep 4, 2020
Copy link
Contributor

@alexanderbez alexanderbez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK. Overall, looks solid! I left a few minor remarks, but otherwise it LGTM 👍

baseapp/baseapp.go Show resolved Hide resolved
simapp/simd/cmd/root.go Show resolved Hide resolved
snapshots/manager.go Show resolved Hide resolved
types/errors/errors.go Show resolved Hide resolved
@zmanian
Copy link
Member

zmanian commented Sep 5, 2020

This seems so close to merging that I'm excited to have it stargate-2

@alexanderbez
Copy link
Contributor

Lets fix the merge conflicts and get this bad boy merged.

@erikgrinaker
Copy link
Contributor Author

Awesome, thanks for reviewing, I know it was a fairly big chunk! Will resolve and merge first thing tomorrow!

@erikgrinaker erikgrinaker added the A:automerge Automatically merge PR once all prerequisites pass. label Sep 8, 2020
@mergify mergify bot merged commit 4faeefe into master Sep 8, 2020
@mergify mergify bot deleted the erik/state-sync branch September 8, 2020 09:05
larry0x pushed a commit to larry0x/cosmos-sdk that referenced this pull request May 22, 2023
* Add state sync support

* fix incorrect test tempdir

* proto: move and update Protobuf schemas

* proto: lint fixes

* comment tweaks

* don't use type aliasing

* don't call .Error() when logging errors

* use create terminology instead of take for snapshots

* reuse chunk hasher

* simplify key encoding code

* track chunk index in Manager

* add restoreDone message for Manager

* add a ready channel to Snapshotter.Restore()

* add comment on streaming IO API

* use sdkerrors for error handling

* fix incorrect error

* tweak changelog

* syntax fix

* update test code after merge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A:automerge Automatically merge PR once all prerequisites pass. C:ABCI C:baseapp
Projects
None yet
Development

Successfully merging this pull request may close these issues.

State sync ABCI support State sync snapshotting
7 participants