Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsr: sync uses correct view to go into recovering_head #1705

Merged
merged 4 commits into from
Mar 15, 2024
Merged

Conversation

matklad
Copy link
Member

@matklad matklad commented Mar 15, 2024

It might be the case that op_checkpoint:

  • is prepared in view X
  • is truncated in view X+1
  • is committed in view X+2

When deciding whether to go into recovering_head state, the replica
currently considers prepared view (checkpoint.header.view), and not the
committed view.

Use the correct view by:

  • Adding checkpoint's view into VSR State (but not checkpoint state:
    different replicas might commit prepares at different views)

  • Populating that view from the message that informed us about the
    checkpoint target

    • this requires some intricate logic on pings, to make sure they
      indeed propagate correct view for checkpoint --- a replica accepts a
      checkpoint before it transitions to its view, and it should
      subsequently correctly propagate this higher view.

      This works because checkpoint view is also durable.

Seed: 8593423301425288917
Closes: #1703

@matklad
Copy link
Member Author

matklad commented Mar 15, 2024

last commit fixes #1702 -- a semi related vopr seed that also trips on main

Copy link
Member

@sentientwaffle sentientwaffle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update sync_view during superblock.checkpoint()? Then we would have the invariant that sync_view >= checkpoint.header.view.

assert(message.header.view <= self.view);
// Pings advertise checkpoints, and current checkpoint's view might be greater than
// the replica view.
if (message.header.view > self.view) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be > self.view_durable()? We might have just called transition_to_normal_from_recovering_head_status() so our view_durable is not yet updated, but our self.view is updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I believe that

            if (message.header.view > self.view_durable()) {
                assert(self.superblock.working.vsr_state.sync_view >
                    self.superblock.working.vsr_state.view);
            }

would hold, but that's an almost tautological assertion.

In contrast

            if (message.header.view > self.view_durable()) {
                assert(self.status == .recovering_head);
            }

would not hold -- transition_to_normal_from_recovering_head_status changes status, but leaves self.superblock.working.vsr_state intact until durable update finishes.

And

            if (message.header.view > self.view) {
                assert(self.status == .recovering_head);

Is the interesting case -- it ensures that we can't get out out .recovering_head state earlier than self.superblock.working.vsr_state.view

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I get it; thank you.

@matklad matklad force-pushed the matklad/sync-view branch 2 times, most recently from ba6e054 to 8c352c5 Compare March 15, 2024 15:13
@matklad
Copy link
Member Author

matklad commented Mar 15, 2024

Should we update sync_view during superblock.checkpoint()? Then we would have the invariant that sync_view >= checkpoint.header.view.

Excellent point! We should at least reset it to zero once the sync is done. But I think it's best not to update it, and to keep it scoped strictly to state sync, rather than mix in both sync and non-sync code paths in a single field.

@matklad matklad force-pushed the matklad/sync-view branch 2 times, most recently from 960f8c3 to 84bc115 Compare March 15, 2024 15:19
@@ -958,6 +970,7 @@ pub fn SuperBlockType(comptime Storage: type) type {
vsr_state.commit_max = update.commit_max;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A few lines above this) we can assert that update.sync_view >= update.checkpoint.header.view.

@@ -863,6 +873,7 @@ pub fn SuperBlockType(comptime Storage: type) type {
vsr_state.commit_max = update.commit_max;
vsr_state.sync_op_min = update.sync_op_min;
vsr_state.sync_op_max = update.sync_op_max;
vsr_state.sync_view = update.sync_view;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could assert that update.sync_view == 0 or update.sync_view == superblock.staging.vsr_state.sync_view.

matklad added 2 commits March 15, 2024 15:34
It might be the case that op_checkpoint:

* is prepared  in view X
* is truncated in view X+1
* is committed in view X+2

When deciding whether to go into recovering_head state, the replica
currently considers prepared view (checkpoint.header.view), and not the
committed view.

Use the correct view by:

* Adding checkpoint's view into VSR State (but _not_ checkpoint state:
  different replicas might commit prepares at different views)
* Populating that view from the message that informed us about the
  checkpoint target

  * this requires some intricate logic on pings, to make sure they
    indeed propagate correct view for checkpoint --- a replica accepts a
    checkpoint before it transitions to its view, and it should
    subsequently correctly propagate this higher view.

    This works because checkpoint view is also durable.

Seed: 8593423301425288917
Closes: #1703
@matklad matklad force-pushed the matklad/sync-view branch from 84bc115 to 1323532 Compare March 15, 2024 15:34
@matklad matklad force-pushed the matklad/sync-view branch from 1323532 to 10665e1 Compare March 15, 2024 15:45
@matklad matklad added this pull request to the merge queue Mar 15, 2024
Merged via the queue into main with commit 341af50 Mar 15, 2024
27 checks passed
@matklad matklad deleted the matklad/sync-view branch March 15, 2024 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash: 8593423301425288917
2 participants