Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast joiner catchup #63

Closed
achamayou opened this issue May 13, 2019 · 8 comments
Closed

Fast joiner catchup #63

achamayou opened this issue May 13, 2019 · 8 comments

Comments

@achamayou
Copy link
Member

There should be a more efficient late-join/recovery mechanism than replaying through history.

@ashamis ashamis self-assigned this Jul 10, 2019
@ashamis ashamis added this to the ePBFT in CCF milestone Jul 10, 2019
@achamayou
Copy link
Member Author

Depends on #236

@achamayou
Copy link
Member Author

Following discussion with @olgavrou, here's a sketch of how this could work for CFT.

A snapshot is a transaction made of a write-set containing all keys in all replicated tables at a globally committed version. Its version is the committed version for which it was produced. It contains a flag identifying it as a snapshot.
A node producing a snapshot signs a digest of that snapshot. That signature is not stored in the snapshot itself. The node records that snapshot and its evidence to a file on disk (the snapshot file, a series of snapshots, similar to but distinct from the ledger file).

A new node connecting to a network can request the most recent snapshot from the node it initiates its connection against, along with evidence. It writes the evidence for the snapshot to a derived table (to preserve offline verifiability), and starts operating from that commit onwards.

On catastrophic recovery, a list of snapshots with attached evidence can be collated. A recovery proposed from a snapshot would also expose the aggregated evidence from old nodes, which can be examined by the members when voting for/against the recovery. A straightforward case could be recovery from a snapshot with evidence from at least f+1 nodes as of the last known configuration.

To allow for this, it is preferable that snapshots be produced at the same version on all nodes, for example every N globally committed versions.

During snapshot production, compaction needs to be momentarily turned off. The KV can record the most recent compaction event triggered during snapshot production and trigger it after the production ends.

It may be useful for the primary to replicate the digest of the most recent snapshot. A late joiner would be able to remove the snapshot evidence once that digest commits (because it is then attributable to the primary).

@olgavrou
Copy link
Member

olgavrou commented Dec 17, 2019

I think the snapshot should be over all tables - not just the replicated ones. If you are using the snapshot to catch up then you don't want to execute all the requests from the beginning of time to arrive to the same derived state.

@achamayou
Copy link
Member Author

This sketch is for CFT, for BFT the scheme needs some variation (the snapshots must be signed by more than the producing node for example).

If we do use the derived tables to store the snapshot evidence also, I don't think they can be part of the snapshot itself.

@ashamis
Copy link
Contributor

ashamis commented Jan 5, 2020

The BFT variant should not require too many changes from what is described here.

Every replica should generate a checkpoint at every X per-prepares. X should be part of the configuration of CCF.

On the X + K pre-prepare the primary would send out the digest of the checkpoint and the replicas would send out their digest in the their prepare messages. Assuming 2f+1 agree with the primary (and other requirements hold) then the X+K can be committed otherwise a view-change is required and the agreement process at X+K will be restarted. Again K is a CCF configuration parameter.

Once we have agreement on the checkpoint each machine can persisted its own checkpoint along with proof that their checkpoint is correct.

Here is a summary:

  • All the replicas would attempt to create a checkpoint at a well known interval.
  • The checkpoint created by each machine must be byte by byte identical. This is in keeping with the principle that we use for creating the ledger.
  • A checkpoint stored on any machine needs to contain enough proof that it can been deemed correct.

@jumaffre jumaffre changed the title State checkpointing Fast joiner catchup May 5, 2020
@jumaffre
Copy link
Contributor

jumaffre commented May 5, 2020

Points discussed with @achamayou to consider in implementing this:

  • Is the snapshot stored in the KV or passed between nodes via another way that KV replication? Probably not stored in the KV as it would mean keeping the entire snapshot in memory
  • Is the snapshot stored in the ledger or in a different file? A different file seems better for now, especially that the ledger is likely to change slightly (see Split ledger into multiple files #1135)
  • Is it only the primary that produces snapshots or can backups do it as well?
  • If the snapshot is signed by the node producing it, how does a new joiner verify that the snapshot can be trusted?
  • How does this impact the Merkle tree?
  • Is the snapshot a generalisation of the genesis transaction?
  • How does this impact the recovery protocol? For now, it is OK if the recovery ignores snapshot and replays the transactions since the beginning of time as recovery is an expensive procedure anyway. It also means that all the governance operations since genesis can be audited.

I will investigate this further in the next few days and aim to provide answers to all of these.

@achamayou
Copy link
Member Author

What can probably be replicated through the consensus is a digest of the snapshot, signed by the producing node in the case of raft. That way it's at least possible to get consensus on that.

@achamayou
Copy link
Member Author

Done in #1500, #1532 and others, and now available as an experimental feature. Fixes and improvements to be tracked in further tickets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants