Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable local snapshot restore (backport: #733) #728

Closed
wants to merge 2 commits into from

Conversation

yihuang
Copy link
Contributor

@yihuang yihuang commented Apr 19, 2023

Closes: #29
Original PR: tendermint/tendermint#9541

Sync state in Handshake via RPC

CV Cleanup dead code


PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments

Sync state in Handshake via RPC

CV Cleanup dead code
@yihuang yihuang changed the title Enable local snapshot restore Enable local snapshot restore (backport: #733) Apr 20, 2023
@yihuang yihuang marked this pull request as draft April 20, 2023 01:45
Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initial pass at this solution, very grateful for the work here!

I left some questions / suggestions. I would be glad to fix myself some of the concerns I raised directly in the dev branch, if that's ok with you @yihuang and @chillyvee; but will of course give you priority to make commits on the branch.

if err := doHandshake(stateStore, state, blockStore, genDoc, eventBus, proxyApp, consensusLogger); err != nil {

// RPC recovery is required if blockStore Height is 0
if blockStore.Height() == 0 || !stateSync {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also check that state.LastBlockHeight == 0 in addition to blockStore.Height() == 0?

Something like:

Suggested change
if blockStore.Height() == 0 || !stateSync {
if (blockStore.Height() == 0 && state.LastBlockHeight == 0) || !stateSync {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worth discussing since there are gaps in knowledge that need to be filled. There are three cases where the comet height is zero.

  1. Actually genesis
  2. Statesync completely over the network as we use today
  3. LocalStatesync where the application.db is already restored, but comet is at 0

One of the cosmos-sdk to comet integration challenges for local state sync is communicating the height of appstate/application.db

In the initial implementation, an extra applicationd.db height parameter was passed from cosmos-sdk to comet. However a previous code review tried to eliminate that extra parameter and simply check if comet height was 0, and start RPC statesync.

However that actually causes some issues because a normal network statesync also has comet height == 0, but application.db is also 0.

We need an agreeable way to determine the height of application.db from cosmos-sdk. Since a previous code review determined a parameter was not desirable, is there a better way to query the application height? Is there an object we should look at, or make an ABCI call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the detailed notes and history here, I wasn't aware of the subtlety here!

Is there an object we should look at, or make an ABCI call?

An ABCI call (info) would indeed provide the app height I believe. But not sure that's the best way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally it was added as a variable to configure when cosmos-sdk created the Node{}, but it was asked to be removed during code-review, probably to avoid creating a new dependency between cosmos-sdk and comet

Now that we know the removal causes an issue, it's up to us to pick a new way to compare the heights of comet vs cosmos-sdk app.

Is there general guidance on how to communicate variables between comet and cosmos-sdk?

Copy link
Collaborator

@cason cason May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for context, a valid CometBFT state requires:

 blockStore.Height() - 1  <= state.LastBlockHeight <=  blockStore.Height()

So, if the commit of block at height blockStore.Height() has been completed, the state height should be the same. If it hasn't, as the node crashed in the while, the state height should be 1 unit below.

Having said so, if the state of CometBFT is valid and blockStore.Height() == 0 then state.LastBlockHeight == 0 is mandatory.

// Optimistically build new state, so we don't discover any light client failures at the end.
state, err = h.stateProvider.State(pctx, appBlockHeight)
if err != nil {
h.logger.Error("failed to fetch and verify tendermint state", "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should we use "comet" instead of "tendermint" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

if err == light.ErrNoWitnesses {
return sm.State{}, nil, err
}
return sm.State{}, nil, errors.New("snapshot was rejected")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific in this error message? It seems to me that what happened here was that Comet's light client verified the app hash present in the local snapshot of app.db, and this app hash was incorrect. Is this accurate?

Copy link
Contributor

@chillyvee chillyvee Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing func (s *lightClientStateProvider) State(ctx context.Context, height uint64) (sm.State, error) {

The error conditions are failure to VerifyLightBlockAtHeight , RPC Client Creation Failure (Bad URL), and failure to fetch consensus parameters for the height.

What do you think of wrapping the error to add detail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good idea. Let's wrap the err to save the detail and return it to the caller.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not snapshot involved here.

The error should say that the application state was rejected, instead.

if err == light.ErrNoWitnesses {
return sm.State{}, nil, err
}
return sm.State{}, nil, errors.New("snapshot was rejected")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above: Would it be good to have more explicit error message here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of wrapping the error to add detail?

@yihuang
Copy link
Contributor Author

yihuang commented Apr 21, 2023

Haven't looked into detail, but find an unexpected behavior in test: when the app height is 0, statesync enabled, it still try to init with genesis.json, which don't happen with normal version.

@adizere adizere mentioned this pull request Apr 26, 2023
3 tasks
@github-actions
Copy link

github-actions bot commented May 5, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale For use by stalebot label May 5, 2023
if err == light.ErrNoWitnesses {
return sm.State{}, nil, err
}
return sm.State{}, nil, errors.New("snapshot was rejected")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not snapshot involved here.

The error should say that the application state was rejected, instead.

if err := h.stateStore.Save(state); err != nil {
return sm.State{}, nil, err
}
if err := h.store.(*store.BlockStore).SaveSeenCommit(int64(appBlockHeight), commit); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the original implementation, which is Node.startStateSync (I don't know the branch to refer, so no link, sorry), you should use state.LastBlockHeight here for compatibility. It should be equal to appBlockHeight, except for uint/int stuff, but for coherence it is better to follow the same approach.

h.logger.Info("localSync resulting state.Version.Consensus.Block", "state", state.Version.Consensus.Block)

// Assume Heights Restored, Update Heights for edge cases and constraints on the storeBlockHeight and storeBlockBase.
storeBlockHeight = appBlockHeight
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that these values should be read from the stores, not blindly updated here.

var err error
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
stateProvider, err := statesync.NewLightClientStateProvider(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not using this "local sync", this would error, right?

We have to still support the case in which the node is simply recovering, and will restore all state from disk.

@@ -989,8 +1007,9 @@ func (n *Node) OnStart() error {
return fmt.Errorf("could not dial peers from persistent_peers field: %w", err)
}

// Run state sync
// Statesync reqeusting all data from P2P
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Statesync reqeusting all data from P2P
// State sync fetching snapshots from peers

if n.stateSync {
// If state and blockStore.Height are both at the same height, skip the P2P Statesync and immediately enter consensus
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't understand this comment.

@github-actions github-actions bot removed the stale For use by stalebot label May 6, 2023
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale For use by stalebot label May 16, 2023
@cason
Copy link
Collaborator

cason commented May 17, 2023

Are we proceeding with this approach/solution attempt? If so, we should add the wip label to the PR. Otherwise, GitHub will close it.

@yihuang
Copy link
Contributor Author

yihuang commented May 17, 2023

Are we proceeding with this approach/solution attempt? If so, we should add the wip label to the PR. Otherwise, GitHub will close it.

I think no need to pursue

@github-actions github-actions bot removed the stale For use by stalebot label May 18, 2023
@adizere
Copy link
Member

adizere commented May 22, 2023

Closing, a different solution was chosen.

@adizere adizere closed this May 22, 2023
@adizere adizere added this to the 2023-Q2 milestone May 22, 2023
@chillyvee
Copy link
Contributor

Which other solution was chosen? Just had a break in schedule to work on the Juno v14 local state sync demo today.

@yihuang
Copy link
Contributor Author

yihuang commented May 22, 2023

Which other solution was chosen? Just had a break in schedule to work on the Juno v14 local state sync demo today.

cosmos/cosmos-sdk#16061
cosmos/cosmos-sdk#16067

@chillyvee
Copy link
Contributor

Thank you, will review

@chillyvee
Copy link
Contributor

nice code @yihuang :)

@yihuang yihuang deleted the cv_local_statesync-squash branch May 23, 2023 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants