Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

GRANDPA: Warp Sync #1208

Closed
rphmeier opened this issue Dec 4, 2018 · 20 comments · Fixed by #9227
Closed

GRANDPA: Warp Sync #1208

rphmeier opened this issue Dec 4, 2018 · 20 comments · Fixed by #9227
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.
Milestone

Comments

@rphmeier
Copy link
Contributor

rphmeier commented Dec 4, 2018

Assuming a

trait AncestryApi<Block> {
    // return the most recent ancestor of `at` where `ancestor.number() % 2^n == 0`.
    // OR as close as possible to it.
    // NOTE: only safe when traversing backwards from a FINALIZED block.
    fn log2ancestor(at: &BlockId, n: Block::Number) -> (Block::Number, Block::Hash);
}

What the log2ancestor proofs would be good for is warp sync.

  1. Provide a commit message with "big" ancestry proof for the most recent finalized block.
  2. walk backwards to the last handoff, doing log2-based walk. Prove as much of the handoff's commit-message ancestry with this walk as possible. For any remaining portions, prove with "big" ancestry proof.
  3. repeat 2 until the genesis or last checkpoint is reached.
@rphmeier rphmeier added the I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. label Dec 4, 2018
@rphmeier rphmeier added this to the As-and-when milestone Dec 4, 2018
@rphmeier
Copy link
Contributor Author

rphmeier commented Dec 4, 2018

(We will probably want to put some kind of generalized warp-sync in place first)

@rphmeier
Copy link
Contributor Author

rphmeier commented Aug 12, 2020

Rather than log2-ancestors, the approach of Merkle Mountain Ranges (MMR) could be used instead. A pallet can keep the MMR peaks in the state and the MMR nodes in the offchain DB. Full nodes can then produce ancestry proofs for any block in the prior chain.

With that, the scheme would look like this:

We assume that there is some block that we refer to as our checkpoint where the validator-set and block information are hardcoded into the client. This may be the genesis block.

When we intend to warp-sync GRANDPA, we are presented with a series of validator-set hand-offs in reverse order, starting from the head of the chain and moving backwards to the checkpoint. Our goal is to show that every validator set since the checkpoint has signed off on the following validator set, and then to show that some block we'll call the Goal has been finalized by the most recent validator set.

Checkpoint <- Intermediate Change 1 <- Intermediate Change 2 <- .... <- Last Change <- Goal
                                                                             ^
                                                                             |
                                                                             |
                                         This is where the most recent validator set was activated

In practice, this is done by presenting a series of finality proofs of blocks that triggered validator set changes. Validator set changes in GRANDPA are handled with a signal-execution duality: first, there is a block that signals that a change will happen in N blocks, and then N blocks later (in the same change), the change is applied. On most chains, N is always equal to zero, but it's best not to lean on that assumption for reasons of flexibility.

So proving any particular hand-off from a validator set V to a new set V' is done by proving that V finalized a block B' which triggered a change signaled by its ancestor B, where the change indicated that V' would be the new set.

(signal)
B <--------- B'
                (applied at B' = B + N)

Note that in most chains, N = 0 and thus the signal block B and the activation block B' are one and the same. However, the implementation should support chains with N != 0.

The substrate node already keeps finality proofs (Justifications) of all validator-set change blocks.

We don't need to ensure that all the validator-set changes are actually in the same chain because we lean on the property that each validator-set only finalizes 1 block at each height. The chain of validator-set hand-offs thus justifies the entire chain between Checkpoint and Goal.

And the last piece of the puzzle is making sure that the light client can still fetch ancient blocks even though it's never synced them. The simplest way to do this is to provide the entire blockchain between Checkpoint and Goal, but that is the thing we are trying to avoid with a light sync. So this is where MMRs come in. The state of the latest finalized block commits to a set of peaks of the MMR, which can be used to prove the header-hash of any block in the ancestry. Initially after doing the warp sync, Goal will be the highest finalized block the light client is aware of, but after being active on the network and as the chain progresses, the highest finalized block will move forward. This technique can be used to safely fetch any block up to the last finalized block, which means that the light client can discard finalized blocks as it proceeds.

However, fetching data based on un-finalized MMR peaks is DANGEROUS and should not be done.

@tomaka
Copy link
Contributor

tomaka commented Aug 27, 2020

When we intend to warp-sync GRANDPA, we are presented with a series of validator-set hand-offs in reverse order, starting from the head of the chain and moving backwards to the checkpoint.

I guess that this should be implemented with a networking protocol that is separate from the traditional sync.
A light client would ask "please show me the list of validator-set hand-offs to <checkpoint>", and the remote answers with that list.

Since the identifier of the chain is part of the protocol name is that negotiated between peers, the <checkpoint> in the network message could, in theory, consist of only a block height.

@tomaka
Copy link
Contributor

tomaka commented Aug 27, 2020

Pragmatically speaking, if I'm not mistaken, the authorities list changes every 600 blocks for Kusama.
Supposing that the checkpoint is block #0 and that Kusama is at block 4 million, that'd be 6666 blocks to send on the network (and then to verify).
Empirical tests show that a light client in browser syncs at around ~1000 blocks/sec, so the full syncing would only take ~7 seconds (again, assuming checkpoint block #0, the worst case scenario).

6666 block headers is low enough that their total size can reasonably be put in a single networking message, but I'm a bit worried for the future. We might need to ask light clients with a far-enough checkpoint to do the warp-sync in multiple steps by sending multiple requests.
I would therefore suggest that the, when a warp sync request is made, the full node answers with the list starting at <checkpoint> and going up in block numbers.
This way, the full node is free to only send back a partial list, and the light client can then resume the syncing by sending a second requester with a checkpoint corresponding to the last block of the first request.

@rphmeier
Copy link
Contributor Author

rphmeier commented Aug 27, 2020

I guess that this should be implemented with a networking protocol that is separate from the traditional sync.
A light client would ask "please show me the list of validator-set hand-offs to ", and the remote answers with that list.

Since the identifier of the chain is part of the protocol name is that negotiated between peers, the in the network message could, in theory, consist of only a block height.

Yeah, that's right, in theory. The verifier on the other side would have to check that the checkpoint block they get back actually has that same hash. The block hash is only 32 bytes though so there wouldn't be much point in jumping through hoops to avoid it.

Empirical tests show that a light client in browser syncs at around ~1000 blocks/sec, so the full syncing would only take ~7 seconds (again, assuming checkpoint block #0, the worst case scenario).

Right, although we don't actually import these blocks and check block signatures, which I imagine is the bottleneck for sync. We do need to verify GRANDPA signatures, so that is going to be our bottleneck after networking. Each finality proof will have several hundred signatures.

We might need to ask light clients with a far-enough checkpoint to do the warp-sync in multiple steps by sending multiple requests.

In the long-term future, we can actually roll all of these handoffs up in an inductive zero-knowledge proof that gives us all of this data in only a few hundred bytes. Until then, yes, we might need to do multiple steps. However, if we have a checkpoint that advances once every month or even once every quarter, then it actually won't be a problem. We want to update the checkpoints every so often anyway because of long-range attacks: really historical validator sets might be compromised and don't have anything at stake anymore.

More near-term, we can avoid the space used by the multiple signatures by switching GRANDPA to BLS, where we can aggregate all signatures into a single piece of data the size of a single signature.

@cheme
Copy link
Contributor

cheme commented Sep 4, 2020

I have been looking a bit at this issue, fwiu you first need to check finality of the 'Goal' block by rebuilding the 'Goal' authority set from all digest authority set change (validator-set hand-offs) from 'Checkpoint' block (I did some coding around it to get and idea of the size involved).
We are fine with it because we got only one signing for any block, so I guess we rely on grandpa equivocation to assume that.

One thing I am still missing from this picture is:
How do you prove that the light client receive all the validator-set hand-offs from the full node without checking all consecutive headers?
(skipping a validator-set hand-offs in the reply leads to couples '(set_id, round_number)' that cannot be equivocated because they do not exists in the first place).

@rphmeier
Copy link
Contributor Author

rphmeier commented Sep 4, 2020

How do you prove that the light client receive all the validator-set hand-offs from the full node without checking all consecutive headers?

Why would you need consecutive headers to prove this? Yes, we are assuming that there are no safety violations and that GRANDPA finalizes only one block at each height. So you can just skip all the headers between any two blocks that you actually need to prove finality on.

@tomaka
Copy link
Contributor

tomaka commented Sep 23, 2020

One little uncertain point is: after we've reached the head of the finalized chain, and we can now switch to regular sync'ing, we need information about the current and next Babe epochs in order to verify incoming blocks.

As far as I understand, the Babe epoch info should most of the time, but not always, be in the block sent during the warp sync. Is that accurate?

If a Babe epoch info is missing, we can always do an exhaustive block request of the blocks before the latest finalized (network block requests support asking for blocks in decreasing order starting from a certain hash), so it's not an unsolvable problem.

@cheme
Copy link
Contributor

cheme commented Sep 23, 2020

I had the same interrogation, mainly is babe state reconstructible from storage proof at warp block? And maybe add info in babe pallet storage?
But I do not have reply for it.

Just replying to previous post with the information I collected in between:

Why would you need consecutive headers to prove this?

Current grandpa equivocation seems to only check for double signing for a given validator set and a given round. I think this is currently enough because we replay all history, IIUC (which is not sure).
With warping a majority of validator of any set can send some single signing of (set_id, round) that never actually happen (eg if set 1 ends at round 5 we can send some set 1 signing at round 8 through warp sync and change validator set this way then build different chain) to synch on a different finalized chain, there will be no equivocation for it.
The light client will be unable to restore all history from block Goal to Checkpoint (would require an equivocable block), but will still be thinking that Goal is finalized.
So I suppose we also need to be able to slash validator for any signature that happens on a round that is not correct for a validator set (a new kind of equivocation to handle, it may require writing last 28 days validator set round bound in the state).

@expenses
Copy link
Contributor

I'm looking into implementing this on the substrate side now. As far as I'm aware the network request should include a checkpoint block hash and the response should include a list of block headers that contain the validator set hand-offs to that checkpoint. Is that correct? Could we reduce the size of the response by sending finality proofs instead?

@cheme
Copy link
Contributor

cheme commented Nov 23, 2020

a list of block headers that contain the validator set hand-offs to that checkpoint. Is that correct?

Not sure, but that is what I started doing in master...cheme:light_grandpa_warp I did test it a bit (I don't remember well but I think I hardcoded a construction and check of such synch proof from block 0 to block n that got executed when starting opening the db on an existing chain (see method 'test_it', not sure what are the deletion in the diff I did link)). It was preliminary work to check if it works well.
(I stopped working on it for the reason mentioned in my previous post, but it is not much work up to this point, so it is probably best to start from a clean state).

Could we reduce the size of the response by sending finality proofs instead?

Not sure but I think for trusting finality proof we would need to sync underlying state before (fwiu we trust justification because there is equivocation possible on it).

@expenses
Copy link
Contributor

expenses commented Dec 4, 2020

@cheme how ready is this branch to being merged?

@cheme
Copy link
Contributor

cheme commented Dec 4, 2020

It is nothing mergeable, IIRC I just build was building and validating the proof at node start for some hardcoded range, but it is not integrated at all (more to assert all goes well).

@tomaka
Copy link
Contributor

tomaka commented Dec 4, 2020

Quick networking implementation guide:

  • Duplicate this file (the link is from an un-merged PR, which will hopefully be merged soon) and adjust it to serve the warp sync requests instead of block requests. The protocol name (line 57) should be adjusted. The request payload (line 87) should probably consist in just a block height to start from.

  • Add a field to sc_network::config::Params like here.

  • Also duplicate this block, and pass the thing you create to the Params a couple of lines below.

@expenses
Copy link
Contributor

expenses commented Dec 7, 2020

I've created a branch: https://github.com/paritytech/substrate/compare/ashley-grandpa-warp-sync that follows @tomaka's instructions (#1208 (comment)) and merges @cheme's branch (#1208 (comment)). I haven't tested it yet but I believe the basic functionality is there.

@expenses
Copy link
Contributor

expenses commented Dec 7, 2020

I've created a branch: https://github.com/paritytech/substrate/compare/ashley-grandpa-warp-sync that follows @tomaka's instructions (#1208 (comment)) and merges @cheme's branch (#1208 (comment)). I haven't tested it yet but I believe the basic functionality is there.

What's the best way to go about testing this?

@cheme
Copy link
Contributor

cheme commented Dec 7, 2020

Not really sure about testing (tests in client/finality-grandpa looks interesting (there is some authority set changes), but not sure it is the best approach, the tricky part would be to test with invalid change of authority).
Also, I did realize I mistakenly delete "bin/node/inspect/src/lib.rs" in my branch.

@tomaka
Copy link
Contributor

tomaka commented Apr 7, 2021

Update: the "server" side has been implemented. See here.

@stale
Copy link

stale bot commented Jul 7, 2021

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Jul 7, 2021
@rphmeier
Copy link
Contributor Author

rphmeier commented Jul 7, 2021

Server-side implemented, client not.

@stale stale bot removed the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Jul 7, 2021
@ghost ghost closed this as completed in #9227 Aug 2, 2021
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants