Fast joiner catchup #63

achamayou · 2019-05-13T16:49:03Z

There should be a more efficient late-join/recovery mechanism than replaying through history.

achamayou · 2019-07-15T15:05:34Z

Depends on #236

achamayou · 2019-12-16T17:22:01Z

Following discussion with @olgavrou, here's a sketch of how this could work for CFT.

A snapshot is a transaction made of a write-set containing all keys in all replicated tables at a globally committed version. Its version is the committed version for which it was produced. It contains a flag identifying it as a snapshot.
A node producing a snapshot signs a digest of that snapshot. That signature is not stored in the snapshot itself. The node records that snapshot and its evidence to a file on disk (the snapshot file, a series of snapshots, similar to but distinct from the ledger file).

A new node connecting to a network can request the most recent snapshot from the node it initiates its connection against, along with evidence. It writes the evidence for the snapshot to a derived table (to preserve offline verifiability), and starts operating from that commit onwards.

On catastrophic recovery, a list of snapshots with attached evidence can be collated. A recovery proposed from a snapshot would also expose the aggregated evidence from old nodes, which can be examined by the members when voting for/against the recovery. A straightforward case could be recovery from a snapshot with evidence from at least f+1 nodes as of the last known configuration.

To allow for this, it is preferable that snapshots be produced at the same version on all nodes, for example every N globally committed versions.

During snapshot production, compaction needs to be momentarily turned off. The KV can record the most recent compaction event triggered during snapshot production and trigger it after the production ends.

It may be useful for the primary to replicate the digest of the most recent snapshot. A late joiner would be able to remove the snapshot evidence once that digest commits (because it is then attributable to the primary).

olgavrou · 2019-12-17T11:18:06Z

I think the snapshot should be over all tables - not just the replicated ones. If you are using the snapshot to catch up then you don't want to execute all the requests from the beginning of time to arrive to the same derived state.

achamayou · 2019-12-17T12:50:44Z

This sketch is for CFT, for BFT the scheme needs some variation (the snapshots must be signed by more than the producing node for example).

If we do use the derived tables to store the snapshot evidence also, I don't think they can be part of the snapshot itself.

ashamis · 2020-01-05T16:14:00Z

The BFT variant should not require too many changes from what is described here.

Every replica should generate a checkpoint at every X per-prepares. X should be part of the configuration of CCF.

On the X + K pre-prepare the primary would send out the digest of the checkpoint and the replicas would send out their digest in the their prepare messages. Assuming 2f+1 agree with the primary (and other requirements hold) then the X+K can be committed otherwise a view-change is required and the agreement process at X+K will be restarted. Again K is a CCF configuration parameter.

Once we have agreement on the checkpoint each machine can persisted its own checkpoint along with proof that their checkpoint is correct.

Here is a summary:

All the replicas would attempt to create a checkpoint at a well known interval.
The checkpoint created by each machine must be byte by byte identical. This is in keeping with the principle that we use for creating the ledger.
A checkpoint stored on any machine needs to contain enough proof that it can been deemed correct.

jumaffre · 2020-05-05T08:28:58Z

Points discussed with @achamayou to consider in implementing this:

Is the snapshot stored in the KV or passed between nodes via another way that KV replication? Probably not stored in the KV as it would mean keeping the entire snapshot in memory
Is the snapshot stored in the ledger or in a different file? A different file seems better for now, especially that the ledger is likely to change slightly (see Split ledger into multiple files #1135)
Is it only the primary that produces snapshots or can backups do it as well?
If the snapshot is signed by the node producing it, how does a new joiner verify that the snapshot can be trusted?
How does this impact the Merkle tree?
Is the snapshot a generalisation of the genesis transaction?
How does this impact the recovery protocol? For now, it is OK if the recovery ignores snapshot and replays the transactions since the beginning of time as recovery is an expensive procedure anyway. It also means that all the governance operations since genesis can be audited.

I will investigate this further in the next few days and aim to provide answers to all of these.

achamayou · 2020-05-05T10:26:08Z

What can probably be replicated through the consensus is a digest of the snapshot, signed by the producing node in the case of raft. That way it's at least possible to get consensus on that.

achamayou · 2020-08-28T12:11:17Z

Done in #1500, #1532 and others, and now available as an experimental feature. Fixes and improvements to be tracked in further tickets.

achamayou added the enhancement label May 13, 2019

achamayou added the performance label May 27, 2019

ashamis self-assigned this Jul 10, 2019

ashamis added this to the ePBFT in CCF milestone Jul 10, 2019

achamayou modified the milestones: ePBFT in CCF, Improved ePBFT in CCF Jul 15, 2019

ashamis removed their assignment Nov 28, 2019

achamayou modified the milestones: Complete ePBFT support in CCF, Performance improvements Dec 11, 2019

achamayou assigned jumaffre Apr 29, 2020

jumaffre changed the title ~~State checkpointing~~ Fast joiner catchup May 5, 2020

ashamis mentioned this issue Jun 12, 2020

checkpointing - Serialize the Champ map #1284

Merged

ashamis modified the milestones: Performance improvements, Snapshot Jun 25, 2020

achamayou added the 0.13 label Aug 17, 2020

achamayou closed this as completed Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast joiner catchup #63

Fast joiner catchup #63

achamayou commented May 13, 2019

achamayou commented Jul 15, 2019

achamayou commented Dec 16, 2019

olgavrou commented Dec 17, 2019 •

edited

Loading

achamayou commented Dec 17, 2019

ashamis commented Jan 5, 2020 •

edited

Loading

jumaffre commented May 5, 2020

achamayou commented May 5, 2020

achamayou commented Aug 28, 2020

Fast joiner catchup #63

Fast joiner catchup #63

Comments

achamayou commented May 13, 2019

achamayou commented Jul 15, 2019

achamayou commented Dec 16, 2019

olgavrou commented Dec 17, 2019 • edited Loading

achamayou commented Dec 17, 2019

ashamis commented Jan 5, 2020 • edited Loading

jumaffre commented May 5, 2020

achamayou commented May 5, 2020

achamayou commented Aug 28, 2020

olgavrou commented Dec 17, 2019 •

edited

Loading

ashamis commented Jan 5, 2020 •

edited

Loading