Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsr: don't repair potentially unavailable ops #1579

Merged
merged 2 commits into from
Feb 23, 2024
Merged

Conversation

matklad
Copy link
Member

@matklad matklad commented Feb 20, 2024

op_repair_min tries to be smart and optimistic and repair more, which
sometimes lead to unavailability, due to two issues:

  • primary tries to repair more than backups, but because backups repair
    less, those prepares might be unavailable! That is, we might suffer
    more than f failures for particular prepare over time, because it
    is not continuously repaired on the backups
  • after state sync, replicas simply do not have any prepares beyond
    checkpoint.

Simplify op_repair_min logic to always repair from the latest
"confirmed" (committed on top of) checkpoint.

That is, replicas right at the trigger repair the entire log wrap, in
addition to having the checkpoint. This ensures smooth hand off of
durability: prepares stop being repaired only when there's a quorum of
replicas at a checkpoint including the prepare.

Closes: #1378

@matklad matklad force-pushed the matklad/op-repair-min branch 4 times, most recently from 15c9d92 to b45c484 Compare February 23, 2024 13:15
@matklad matklad marked this pull request as ready for review February 23, 2024 13:18
@matklad matklad force-pushed the matklad/op-repair-min branch from b45c484 to f0ab527 Compare February 23, 2024 13:22
@matklad matklad added vopr and removed vopr labels Feb 23, 2024
@matklad matklad force-pushed the matklad/op-repair-min branch 2 times, most recently from 24177ca to b400adf Compare February 23, 2024 14:27
// And we don't need op_checkpoint's prepare/header in our log if we crash.
break :op self.op_checkpoint() + 1;
}
break :op (self.op_checkpoint() + 1) -| constants.vsr_checkpoint_interval;
}
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to add an extra assert here:

assert(self.prepare_max() - op < constants.journal_slot_count);

but that doesn't actually work for a funny reason:

  • it can be violated if your op_repair_min still uses the previous checkpoint, and prepare_max already picks the next one (that is, commit_max is right on the trigger).
  • but it actually isn't really violated --- when you receives that op_prepare_max, you advance your commit_max, which causes you to move op_repair_min by a checkpoint forward as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but wait, I can assert that on_prepare instead!

op_repair_min tries to be smart and optimistic and repair _more_, which
sometimes lead to unavailability, due to two issues:

- primary tries to repair more than backups, but because backups repair
  less, those prepares might be unavailable! That is, we might suffer
  _more_ than `f` failures  for particular prepare over time, because it
  is not continuously repaired on the backups
- after state sync, replicas simply do not have any prepares beyond
  checkpoint.

Simplify op_repair_min logic to always repair from the latest
"confirmed" (committed on top of) checkpoint.

That is, replicas _right_ at the trigger repair the entire log wrap, in
addition to having the checkpoint. This ensures smooth hand off of
durability: prepares stop being repaired only when there's a quorum of
replicas at a checkpoint including the prepare.
@matklad matklad force-pushed the matklad/op-repair-min branch from b400adf to 47e9be8 Compare February 23, 2024 14:39
@matklad matklad added the vopr label Feb 23, 2024
Copy link
Member

@sentientwaffle sentientwaffle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code all LGTM!

// we are stuck waiting for a prepare or block which will never arrive.
// It is possible that we could in fact repair and progress anyway, but this is not
// guaranteed: the rest of the cluster no longer repairs the relevant prepares.
// We might be stuck waiting for a prepare or block which will never arrive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We might be stuck waiting for a prepare or block which will never arrive.
// We might be stuck waiting for a prepare or block which will never arrive.

(Extra space.)

/// - primaries do repair checkpointed ops so that they can help lagging backups catch up,
/// as long as the op is new enough to be present in the WAL of some other replica and
/// is from the previous checkpoint.
/// Safety condition: repairing an old op must not overwrite a newer op from the next wrap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Safety condition: repairing an old op must not overwrite a newer op from the next wrap.
/// Safety condition: repairing an old op must not overwrite a newer op from the next wrap.

Extra space

/// it in a quorum of checkpoints.
///
/// If op=trigger+1 is committed, the corresponding checkpoint is confirmed to be present on
/// a quorum of replicas. Repairig all ops since the latest confirmed checkpoint satisfies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// a quorum of replicas. Repairig all ops since the latest confirmed checkpoint satisfies
/// a quorum of replicas. Repairing all ops since the latest confirmed checkpoint satisfies

@matklad matklad removed the vopr label Feb 23, 2024
@matklad matklad force-pushed the matklad/op-repair-min branch from 129cd26 to 1f39a39 Compare February 23, 2024 15:26
@matklad matklad enabled auto-merge February 23, 2024 15:26
@matklad matklad added this pull request to the merge queue Feb 23, 2024
Merged via the queue into main with commit 4a893ec Feb 23, 2024
27 checks passed
@matklad matklad deleted the matklad/op-repair-min branch February 23, 2024 15:52
matklad added a commit that referenced this pull request Feb 24, 2024
If op=op_checkpoint, but we also know that the checkpoint is durable, we
don't need to repair the op, we can proceed directly to op+1. But
repairs can't advance the op, so we special case this and "pretend" that
we are rearing "op".

As soon as we hear from view's primary, we'll advance our `op` past
`op_checkpoint` and stop pretending that we want to repair `op`.

This pre-existing logic that was erronously removed in #1579

Seed: 15816179864814849519
Closes: #1588
/// Availability condition: each committed op must be present either in a quorum of WALs or
/// it in a quorum of checkpoints.
///
/// If op=trigger+1 is committed, the corresponding checkpoint is durably present on
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but I think this is wrong after #1508!

The condition is now op=prepare_max + 1, not op=trigger + 1 I think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow primaries with a dirty journal
2 participants