RFC: sync refactor #1073

Mirko-von-Leipzig · 2023-05-11T14:07:42Z

Mirko-von-Leipzig
May 11, 2023

This is a proposal for rewriting our sync module to be simpler and to allow for different data sources, in particular p2p.

Status Quo

The current sync module is event based. This was a response to keeping all sqlite IO in a single thread / location -- the top level sync loop. This had the desired effect, but unfortunately logic and reponsibilities are now shared between the outer loop receiving the events, and the L1 and L2 threads generating the events. This has gotten worse over time, as the abstraction became.. difficult.

One particular pain point is that the L2 logic has to "query" the database via events with an embedded oneshot channel. Additionally, knowing when it is safe to switch between pending state and other business logic is kept in L2, but must also be acted on in the outer event loop.

Testing is particularly annoying, to the point where the main test has just been commented / ignored out.

There is also no distinction between syncing and not-syncing.

Proposal

Instead of an event based system, I propose just using async and a central logic unit driving it. Effectively, instead of having L1 & L2 be separate threads with an event channel connecting to the outer sync loop, just have L1 and L2 be async functions which sync controls.

Data source abstraction

Create a p2p focussed data source interface. The current gateway source can then be adapted to match this interface, until such a time as we enable p2p in general (and it can still be a fallback).

Something simple, like:

/// Request / response interace for sync
async fn block_headers(from: BlockNumber, to: BlockNumber) -> Result<Vec<BlockHeader>> {..}
async fn block_bodies(from: BlockNumber, to: BlockNumber)  -> Result<Vec<BlockBody>>{..}
async fn state_updates(from: BlockNumber, to: BlockNumber) -> Result<Vec<StateUpdate>> {..}

/// Source for newly gossip'd block (maybe this should just be a single object? i.e. all data at once.)
type NewBlockHeader = Receiver<BlockHeader>;
type NewBlockBody   = Receiver<BlockBody>;
type NewStateUpdate = Receiver<StateUpdate>;

The exact semantics are coupled with the p2p protocol, but this should be a rough match.

The gateway implementation of this API could easily cache its results (since it gets block header + body in one go), and still return from cache first.

Sync function

There are several potential approaches here, and I think just trying it out will lead to a better feeling of what is good. Most of them boil down to a state machine implementation of some kind.

loop {
  select! {
    fetch_l1_update => handle_new_l1_data().await,
    gossip_rx       => handle_new_gossip_data().await,
    ...
  }
}

Testing

Ideally we can just test each sub-component in isolation. And using pure functions for handles so we can just trigger them on manual test input, instead of requiring a future to complete, or an event to send, or an API to mock.

Mirko-von-Leipzig · 2023-05-19T11:49:38Z

Mirko-von-Leipzig
May 19, 2023
Author

Some more thoughts, noting them down before I keep re-writing them in differents forms again and again.

Let's split our design into two modes, sync and tracking, where the latter is when we are following the tip of the chain. The reason to split these is because I think there are many different options for a sync mode and I don't want sync to constrain the design of tracking. Some potential sync possibilities:

Sync block-by-block forwards (same as current design)
Sync in chunks, do N block headers, then N state updates
"Fast sync" -- start from block N and not genesis (this will almost certainly be a requirement at some stage)
Sync all block headers 1st, in reverse, starting with the latest L1 verified block. This is the safest option as you now use L1 security and don't have to worry about consensus etc. Once all block headers are done, you can use whatever state sync method you want.

These are all somewhat different, and should not be coupled to the design of when we are tracking -- which should probably be the most optimised part, since this is where we are going to be 99% of the time.

Tracking mode

There are several ways to approach this design. Let's first consider what data events we will need to deal with. Note that because we are in tracking mode, we are currently "in sync" i.e. we are within some reasonable number of blocks of the known tip of the chain, both ito L1 and L2. And ideally we are literally at the tip.

Data events

These are the possible data events as I see them. Maybe in the future there will be more. I am also assuming that the data coming in has been verified somehow, i.e. at this point it is canonical data.

New L2 block B arrives via gossipsub
1. if B.parent == local.latest then B becomes local.latest (happy path), otherwise
2. if B.number < local.latest then we have a reorg and should drop blocks >= B.number and then re-attempt to push B from (1a), otherwise
3. if B.number > local.latest + 1 then we have missing blocks and should fetch, and insert those. Lets call this process L2 repair.
L1 state changed to block X
1. if X == any local block then set L1 == L2 pointer to X, otherwise
2. no local block matches L1, initiate a search for the last place where L1 == L2 and update accordingly. Lets call this process L1 repair.

Concurrency

The simplest solution here is to process all events sequentially. i.e.

receive data
process data
if repair is required, initiate repair
complete repair
repeat from (1)

The issue with this is that the repair process includes network IO, and some substantial processing especially if it involves state updates, or many class definitions for example. This means our tracking process will block for a while repair is underway. Probably not good, and difficult to assess the impact because hopefully we always have the happy path.

An alternative is to process all these events and processes concurrently. However, this leads to the problem of state consistency.

State consistency

It's essentially the same problem rust solves, except in our case it also involves a database which rust cannot capture semantically. If we have concurrent tasks then we need some synchronisation over updating the pathfinder state. A task has some view of the local state which may become invalidated by another task mutating it.

One can devise complicated schemes where each task gets to decide when its invalidated, but I think its probably be better to have a central place which has sole dominion over mutating state. This central driver can also decide when a task has become invalidated and may be cancelled.

Task lifecycle

task is spawned from central controller with some snapshot of current state
task processes
task completes (or cancelled) and returns some state diff (change it wants to apply to state)
central controller verifies diff is still applicable i.e. wasn't invalidated by other changes
central controller applies diff
central controller spawns new task(s) if required

Testability

I think we can neatly split a task and the controller into different parts. We can test the task by itself -- especially nice if its pure. We can test the state diff verification and also the application separately. And then the logic of the controller over when it spawns tasks i.e. a small state machine of sorts.

0 replies

Mirko-von-Leipzig · 2024-05-23T06:30:47Z

Mirko-von-Leipzig
May 23, 2024
Author

Mostly implemented at this point.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: sync refactor #1073

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RFC: sync refactor #1073

Mirko-von-Leipzig May 11, 2023

Status Quo

Proposal

Data source abstraction

Sync function

Testing

Replies: 2 comments

Mirko-von-Leipzig May 19, 2023 Author

Tracking mode

Data events

Concurrency

State consistency

Task lifecycle

Testability

Mirko-von-Leipzig May 23, 2024 Author

Mirko-von-Leipzig
May 11, 2023

Mirko-von-Leipzig
May 19, 2023
Author

Mirko-von-Leipzig
May 23, 2024
Author