Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up rollback command #13744

Closed
yihuang opened this issue Nov 3, 2022 · 11 comments
Closed

speed up rollback command #13744

yihuang opened this issue Nov 3, 2022 · 11 comments
Labels

Comments

@yihuang
Copy link
Collaborator

yihuang commented Nov 3, 2022

Summary

rollback command take hours to run.

Problem Definition

As a firefighting command, we hope it runs faster.
In practice, I find that the minimal thing we need to do a rollback is:

  • rollback tendermnt state
  • delete the iavl root nodes
  • set the value of s/latest

So my question is:

  1. Is there any side effects of this other than some wasted disk space (the orphan nodes)
  2. Should we provide this option to user in the rollback command?

Proposal

@tac0turtle
Copy link
Member

I think there are some edge cases to rollback that aren't documented or tested as well. I think the feature needs to be redone entirely

@alexanderbez
Copy link
Contributor

@yihuang do you know where the major bottlenecks are? Is it IAVL pruning?

@yihuang
Copy link
Collaborator Author

yihuang commented Nov 3, 2022

@yihuang do you know where the major bottlenecks are? Is it IAVL pruning?

I didn't do timing, but I guess the traversal deleteNodesFrom and traverseFastNodes are pretty heavy,
It's an archive node, so I guess traverseOrphans should be empty?
Is traverseFastNodes necessary here? when start up, the fast node index will be recreated anyway.
Is the only side-effect of skipping deleteNodesFrom some waste of disk space, don't affect app logic, it seems to work in my testing.

@yihuang
Copy link
Collaborator Author

yihuang commented Nov 11, 2022

It's an archive node, so I guess traverseOrphans should be empty?

I misundetstood, orphan records are there no matter archive node or not, so this one is heavy too, because it iterate all orphan records.

@yihuang
Copy link
Collaborator Author

yihuang commented Nov 11, 2022

  • traverseFastNodes can be removed directly, because fastnode index is rebuilt when restart anyway, if enabled.
  • deleteNodesFrom can be safely skipped, since the nodes are indexed by hash, the only side effect is leaving some dirty data behind
  • traverseOrphans can be optimized by using orphanKeyFormat.Key(version) as prefix to limit the iteration scope.
  • delete root nodes, keep it.

@robert-zaremba
Copy link
Collaborator

robert-zaremba commented Nov 17, 2022

Rollback must be well coordinated. At Umee we had a problem with rollback: the validators didn't cooperate well. Some joined later and couldn't sync with a network because they couldn't validate the blocks. Tendermint removed the last app state, hence couldn't get the validator set to validate a new committed block. It worked for signing, but not for validating.
So many validators had to do a resync with the network. Worst problem was with the archival nodes.

I was describing this problem in Discord, and was suppose to make an issue. However I forgot about it and created it right now: tendermint/tendermint#9715

TL;DR: we need to look back again at the design of the rollback command.

@robert-zaremba
Copy link
Collaborator

Also, for a related note, in state/v2 design version removal was handled way more efficient using DB native features, rather than iterating over the nodes.

@robert-zaremba
Copy link
Collaborator

@tac0turtle why thumbs down? I think it's a good observation to use DB native features if possible, and critical for performance, rather than doing it "naively". We were discussing this approach already 2.5 year ago with @erikgrinaker

@yihuang
Copy link
Collaborator Author

yihuang commented Nov 17, 2022

In our experiment with that snapshot design, the db size bloat pretty fast, because it's based on hard linking the sst files, there are lots of waste of disk space.

@yihuang
Copy link
Collaborator Author

yihuang commented Nov 17, 2022

Rollback must be well coordinated. At Umee we had a problem with rollback: the validators didn't cooperate well.

Yeah, it's pretty dangerous for majority of validators to do rollback together, might risk halting the network.
We only use it to recover full nodes from occasional app hash mismatch issues. Haven't met large scale app hash mismatch issues on validators yet.

@yihuang
Copy link
Collaborator Author

yihuang commented Feb 1, 2023

closing, merged in iavl: cosmos/iavl#636, it works best together with the lazy-iavl-loading.

@yihuang yihuang closed this as completed Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants