vsr: sync uses correct view to go into recovering_head #1705

matklad · 2024-03-15T11:58:00Z

It might be the case that op_checkpoint:

is prepared in view X
is truncated in view X+1
is committed in view X+2

When deciding whether to go into recovering_head state, the replica
currently considers prepared view (checkpoint.header.view), and not the
committed view.

Use the correct view by:

Adding checkpoint's view into VSR State (but not checkpoint state:
different replicas might commit prepares at different views)
Populating that view from the message that informed us about the
checkpoint target
- this requires some intricate logic on pings, to make sure they
  indeed propagate correct view for checkpoint --- a replica accepts a
  checkpoint before it transitions to its view, and it should
  subsequently correctly propagate this higher view.
  
  This works because checkpoint view is also durable.

Seed: 8593423301425288917
Closes: #1703

matklad · 2024-03-15T13:30:20Z

last commit fixes #1702 -- a semi related vopr seed that also trips on main

sentientwaffle

Should we update sync_view during superblock.checkpoint()? Then we would have the invariant that sync_view >= checkpoint.header.view.

sentientwaffle · 2024-03-15T14:01:40Z

src/vsr/replica.zig

-            assert(message.header.view <= self.view);
+            // Pings advertise checkpoints, and current checkpoint's view might be greater than
+            // the replica view.
+            if (message.header.view > self.view) {


Should this be > self.view_durable()? We might have just called transition_to_normal_from_recovering_head_status() so our view_durable is not yet updated, but our self.view is updated.

I don't think so. I believe that

if (message.header.view > self.view_durable()) { assert(self.superblock.working.vsr_state.sync_view > self.superblock.working.vsr_state.view); }

would hold, but that's an almost tautological assertion.

In contrast

if (message.header.view > self.view_durable()) { assert(self.status == .recovering_head); }

would not hold -- transition_to_normal_from_recovering_head_status changes status, but leaves self.superblock.working.vsr_state intact until durable update finishes.

And

if (message.header.view > self.view) { assert(self.status == .recovering_head);

Is the interesting case -- it ensures that we can't get out out .recovering_head state earlier than self.superblock.working.vsr_state.view

Ahh I get it; thank you.

matklad · 2024-03-15T15:14:45Z

Should we update sync_view during superblock.checkpoint()? Then we would have the invariant that sync_view >= checkpoint.header.view.

Excellent point! We should at least reset it to zero once the sync is done. But I think it's best not to update it, and to keep it scoped strictly to state sync, rather than mix in both sync and non-sync code paths in a single field.

sentientwaffle · 2024-03-15T15:23:50Z

src/vsr/superblock.zig

@@ -958,6 +970,7 @@ pub fn SuperBlockType(comptime Storage: type) type {
            vsr_state.commit_max = update.commit_max;


(A few lines above this) we can assert that update.sync_view >= update.checkpoint.header.view.

sentientwaffle · 2024-03-15T15:27:17Z

src/vsr/superblock.zig

@@ -863,6 +873,7 @@ pub fn SuperBlockType(comptime Storage: type) type {
            vsr_state.commit_max = update.commit_max;
            vsr_state.sync_op_min = update.sync_op_min;
            vsr_state.sync_op_max = update.sync_op_max;
+            vsr_state.sync_view = update.sync_view;


We could assert that update.sync_view == 0 or update.sync_view == superblock.staging.vsr_state.sync_view.

It might be the case that op_checkpoint: * is prepared in view X * is truncated in view X+1 * is committed in view X+2 When deciding whether to go into recovering_head state, the replica currently considers prepared view (checkpoint.header.view), and not the committed view. Use the correct view by: * Adding checkpoint's view into VSR State (but _not_ checkpoint state: different replicas might commit prepares at different views) * Populating that view from the message that informed us about the checkpoint target * this requires some intricate logic on pings, to make sure they indeed propagate correct view for checkpoint --- a replica accepts a checkpoint before it transitions to its view, and it should subsequently correctly propagate this higher view. This works because checkpoint view is also durable. Seed: 8593423301425288917 Closes: #1703

…lator

matklad assigned sentientwaffle Mar 15, 2024

matklad force-pushed the matklad/sync-view branch from a2ffbb5 to 164e8a5 Compare March 15, 2024 12:03

sentientwaffle reviewed Mar 15, 2024

View reviewed changes

vsr: checkpoint header is eagerly checked to chain to the journal

e534dee

matklad force-pushed the matklad/sync-view branch 2 times, most recently from ba6e054 to 8c352c5 Compare March 15, 2024 15:13

matklad force-pushed the matklad/sync-view branch 2 times, most recently from 960f8c3 to 84bc115 Compare March 15, 2024 15:19

sentientwaffle reviewed Mar 15, 2024

View reviewed changes

matklad added 2 commits March 15, 2024 15:34

vopr: .recovering_head logic is synchronised between replica and simu…

25bdb6e

…lator

matklad force-pushed the matklad/sync-view branch from 84bc115 to 1323532 Compare March 15, 2024 15:34

vsr: reset sync_view when sync is done

10665e1

matklad force-pushed the matklad/sync-view branch from 1323532 to 10665e1 Compare March 15, 2024 15:45

sentientwaffle approved these changes Mar 15, 2024

View reviewed changes

matklad added this pull request to the merge queue Mar 15, 2024

Merged via the queue into main with commit 341af50 Mar 15, 2024
27 checks passed

matklad deleted the matklad/sync-view branch March 15, 2024 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vsr: sync uses correct view to go into recovering_head #1705

vsr: sync uses correct view to go into recovering_head #1705

matklad commented Mar 15, 2024

matklad commented Mar 15, 2024

sentientwaffle left a comment

sentientwaffle Mar 15, 2024

matklad Mar 15, 2024

sentientwaffle Mar 15, 2024

matklad commented Mar 15, 2024

sentientwaffle Mar 15, 2024

sentientwaffle Mar 15, 2024

		@@ -958,6 +970,7 @@ pub fn SuperBlockType(comptime Storage: type) type {
		vsr_state.commit_max = update.commit_max;

vsr: sync uses correct view to go into recovering_head #1705

vsr: sync uses correct view to go into recovering_head #1705

Conversation

matklad commented Mar 15, 2024

matklad commented Mar 15, 2024

sentientwaffle left a comment

Choose a reason for hiding this comment

sentientwaffle Mar 15, 2024

Choose a reason for hiding this comment

matklad Mar 15, 2024

Choose a reason for hiding this comment

sentientwaffle Mar 15, 2024

Choose a reason for hiding this comment

matklad commented Mar 15, 2024

sentientwaffle Mar 15, 2024

Choose a reason for hiding this comment

sentientwaffle Mar 15, 2024

Choose a reason for hiding this comment