Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsr: increase prepare durability near checkpoint boundary #1346

Merged
merged 1 commit into from
Dec 20, 2023

Conversation

matklad
Copy link
Member

@matklad matklad commented Dec 7, 2023

This fixes the following simulator failure:

$ git reset --hard 87cace1
$ ./zig/zig build simulator_run -Dsimulator-state-machine=testing -- 9854840972719737644

This is what happening:

  • two replicas committed checkpoint trigger (op=59) and prepared the next op=60.
  • two replicas are far behind, want to state sync, and need a canonical checkpoint.
  • one replica has op=28 corrupted in the log and wants to repair it

The key issue is that preparing op=60 already pushes op=28 out from the log, but doesn't yet make the checkpoint canonical (as op=60 isn't committed). As a result:

  • the two lagging replicas can't sync, because there's no canonical checkpoint.
  • the repairing replica can't catch up, because corrupted op is missing from the cluster.
  • the two up-to-date replicas can't commit op=60, because there isn't a prepare quorum with a fresh checkpoint.

Taking a step back, the core issue here is that we don't have enough physical durability for op=28. Logically, it is present on three replicas, once in a WAL (where it got corrupted) and twice as data in the checkpoint. However, as these are two different representations, the cluster is not always capable of reconciling them.

It is useful to consider a different scenario of corruptions here: suppose that a single grid block is corrupted on both checkpoints, and some prepare is corrupted in the log of a catching up replica. There are three corruptions, but as only two are correlated, the cluster should be live. However in this situation effectively any corruption in the grid "intersects" with any corruption in the WAL.

The solution here is to extend durabilities of prepares, such that we only start overwriting prepares from the old checkpoint once the new one is confirmed canonical. Before, the WAL looked like this:

    |<-checkpoint->|<-lsm_bar->|
                               ^
                            trigger

We shorten the checkpoint part to make sure we can fit an extra pipeline worth of messages in:

    |<-checkpoint->|<-lsm_bar->|<-pipeline->|
                               ^
                            trigger

This gives us the following invariant:

A prepare can only be evicted from the WAL once there's a quorum of checkpoints covering this prepare.

if (normal) {
// Case A) all replicas are normal, but commit_max is exactly checkpoint
// trigger.
return vsr.Checkpoint.trigger_for_checkpoint(checkpoint_max) == commit_max;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent an embarassing amount of time trying to find a bug in my changes that fails

zig/zig build simulator_run -Dsimulator-state-machine=testing -Dsimulator-log=short -Doptimize=ReleaseSafe -- 15383775152417957975

only to realize that this is a pre-existing failure!

Before, we were only checking for missing checkpoint in view-change, but it could be the case that in normal one replica is ahead, and is at trigger.

@matklad matklad force-pushed the matklad/checkpoint-interval branch from 28a764b to 9a735b1 Compare December 7, 2023 14:31
@matklad matklad marked this pull request as ready for review December 7, 2023 14:33
Comment on lines 640 to 641
// B) During view change, there is a canonical checkpoint, but replicas can't learn about
// because commit messages don't flow during view change.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// B) During view change, there is a canonical checkpoint, but replicas can't learn about
// because commit messages don't flow during view change.
// B) During view change, there is a canonical checkpoint, but replicas can't learn about it
// because commit messages don't flow during view change.

Typo. But also -- is this correct? Checkpoint ids are on ping messages too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replicas learn checkpoint it itself, but they can't learn that it is canonical, because they don't know commit_max.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are enough replicas on the new checkpoint though, then we don't need to know commit_max:

                if (candidates_matching >= candidates_threshold) {
                    // Fallback case:
                    // In an idle cluster, where the last commit landed on a checkpoint boundary,
                    // we still want to be able to sync a lagging replica.
                    log.debug("{}: on_{s}: jump_sync_target: " ++
                        "candidate checkpoint is canonical (quorum: {}/{})", .{
                        self.replica,
                        @tagName(header.command),
                        candidates_matching,
                        candidates_threshold,
                    });
                    break :canonical true;
                }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted comment:

  • there aren't enough replicas in core at a given checkpoint, so they can't prove checkpoint canonical from pings
  • and they also can't learn about canonicity from commit messages, because there aren't any

src/simulator.zig Outdated Show resolved Hide resolved
src/vsr/replica.zig Outdated Show resolved Hide resolved
Comment on lines +4018 to +4050
self.op_checkpoint_next_trigger() -| constants.vsr_checkpoint_interval,
self.op_checkpoint_next_trigger() -| constants.vsr_checkpoint_interval * 2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much more readable! ❤️

src/vsr.zig Outdated
// First wrap: op_checkpoint_next = 8-2-1 = 5
break :op constants.journal_slot_count - constants.lsm_batch_multiple - 1;
break :op constants.vsr_checkpoint_interval - 1;
} else {
// Second wrap: op_checkpoint_next = 5+8-2 = 11
// Third wrap: op_checkpoint_next = 11+8-2 = 17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update these "First/Second/Third wrap" notes to include a pipeline?
Also the diagram above pub const Checkpoint

@@ -1293,7 +1293,8 @@ const ViewChangeHeadersArray = struct {
assert(headers.command == .start_view);
// This function is only called by a replica that is lagging behind the primary's
// checkpoint, so the start_view has a full suffix of headers.
assert(headers.array.get(0).op >= constants.journal_slot_count);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't fully understand this assert. Is it checking that, if the primary is on the "next" checkpoint, then we have at least checkpoint+trigger worth of messages?

But is this actually correct? The first checkpoint is a bit shorter, due to - 1, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud:

start_view_into_do_view_change is called by a replica that is transitioning from normal -> view_change, but is lagging behind the primary by at least a checkpoint, and the SV it has from the old view is from that newer checkpoint.

Using the checkpoint diagram for the purpose of mental model...:

///     commit log (ops)           │ write-ahead log (slots)
///     0   4   8   2   6   0   4  │ 0---4---
///   0 ─────✓·%                   │ 01234✓6%   initial log fill
///   1 ───────────✓·%             │ 890✓2%45   first wrap of log
///   2 ─────────────────✓·%       │ 6✓8%0123   second wrap of log
///   3 ───────────────────────✓·% │ 4%67890✓   third wrap of log
  • In the first wrap, that means that our self.op < 8, so headers.array.get(0).op >= 8. 8 - 1 is our trigger, so it is the highest op that could possibly be in our log.
  • So I think that the old assert was correct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! That -1 in the computation of the frist checkpoint threw me into a loop. I was thinking we have some fancy logic where op=0 doesn't actually go into the loog, because it's a root op or something (so, the first checkpoint is physically shorter than the rest). But that -1 is just count <-> index conversion --- the first checkpoint has exactly checkpoint_interval ops (like all the other checkpoints), the first op number is 0, the last op number is len - 1

@matklad matklad force-pushed the matklad/checkpoint-interval branch 5 times, most recently from 232d936 to eda53b8 Compare December 18, 2023 16:40
}
return self.superblock.working.checkpoint_id();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wasn't entirely correct actually! For ops far in the future, we'd use current checkpoint ID, rather than null.

@matklad matklad force-pushed the matklad/checkpoint-interval branch from eda53b8 to f0dfd19 Compare December 18, 2023 16:48
///
/// checkpoint() call 0 1 2 3
/// op_checkpoint 0 5 11 17
/// op_checkpoint_next 5 11 17 23
/// op_checkpoint_next_trigger 7 13 19 25
///
/// commit log (ops) │ write-ahead log (slots)
/// 0 4 8 2 6 0 4 │ 0---4---
/// 0 ─────✓·% │ 01234✓6% initial log fill
/// 1 ───────────✓·% │ 890✓2%45 first wrap of log
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a bit confusing how, on the right, we show full wrap of the log (up to op=15), while on the left we are up to 13. In the new version, I made left and right to show the same data

} else {
// Second wrap: op_checkpoint_next = 5+8-2 = 11
// Third wrap: op_checkpoint_next = 11+8-2 = 17
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it would be less confusing to not decompose checkpoint interval. After all, it is morally a parameter, and can be set arbitrary, as long as it allows at least a trigger plus pipeline of slack.

@matklad matklad force-pushed the matklad/checkpoint-interval branch from f0dfd19 to 626d1fe Compare December 18, 2023 17:10
@matklad matklad force-pushed the matklad/checkpoint-interval branch from 626d1fe to 6cdc904 Compare December 18, 2023 17:32
src/vsr.zig Outdated
Comment on lines 1350 to 1354
/// 0 4 8 2 6 0 4 │ 0 - - - 4 - - - 8
/// 0 ─────✓·% │[0 1 2 3 4 ✓]6 % R
/// 1 ───────────✓·% │ 9 0 ✓]2 % 5[6 7 8
/// 2 ─────────────────✓·% │ 8 % 1[2 3 4 5 6 ✓]
/// 3 ───────────────────────✓·% │[8 9 0 1 2 ✓]4 % 7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbering + brackets are off, I think?:

Suggested change
/// 0 4 8 2 6 0 4 │ 0 - - - 4 - - - 8
/// 0 ─────✓·% │[0 1 2 3 4 ✓]6 % R
/// 1 ───────────✓·% │ 9 0 ✓]2 % 5[6 7 8
/// 2 ─────────────────✓·% │ 8 % 1[2 3 4 5 6 ✓]
/// 3 ───────────────────────✓·% │[8 9 0 1 2 ✓]4 % 7
/// 0 4 8 2 6 0 4 │ 0 - - - 4 - - - 8
/// 0 ─────✓·% │ 0 1 2 3 4 ✓]6 % R
/// 1 ───────────✓·% │ 9 0 ✓]2 % 4 5 6 ✓]
/// 2 ─────────────────✓·% │ 8 % 0 1 2 ✓]4 % 6
/// 3 ───────────────────────✓·% │ 7 8 ✓]0 % 2 3 4 ✓]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is correct, but there's high chance we might be reading the chart differently, as I find the original version confusing.

Eg, in the second row/checkpoint, when I write 9 0 ✓]2 % 5[6 7 8, that 5 is correct --- it is a 5 (not 15) from the first checkpoint which we haven't yet overwritten. And the checkpoint is [6 7 8 9 10 11].

I think the original version of the chart on the right showed full log wraps, with all entries overwritten, but that I think is confusing --- we have full wraps on the right, but full checkpoints on the left.

Now that I think about it, theer's no reason to compress 10 to just 0 on the right --- there's space to write full ops, ptal!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I understand now, thank you! Yes, we were reading the chart differently, but your version is much more useful than mine 🎉

@matklad matklad force-pushed the matklad/checkpoint-interval branch from 6cdc904 to 5ad4bfa Compare December 19, 2023 13:22
@matklad matklad enabled auto-merge December 19, 2023 15:15
This fixes the following simulator failure:

$ git reset --hard 87cace1
$ ./zig/zig build simulator_run -Dsimulator-state-machine=testing -- 9854840972719737644

This is what happening:

- two replicas committed checkpoint trigger (op=59) and prepared the
  next op=60.
- two replicas are far behind, want to state sync, and need a canonical
  checkpoint.
- one replica has op=28 corrupted in the log and wants to repair it

The key issue is that preparing op=60 already pushes op=28 out from the
log, but doesn't yet make the checkpoint canonical (as op=60 isn't
committed). As a result:

- the two lagging replicas can't sync, because there's no canonical
  checkpoint.
- the repairing replica can't catch up, because corrupted op is missing
  from the cluster.
- the two up-to-date replicas can't commit op=60, because there isn't a
  prepare quorum with a fresh checkpoint.

Taking a step back, the core issue here is that we don't have enough
_physical_ durability for op=28. _Logically_, it is present on three
replicas, once in a WAL (where it got corrupted) and twice as data in
the checkpoint. However, as these are two different representations,
the cluster is not always capable of reconciling them.

It is useful to consider a different scenario of corruptions here:
suppose that a single grid block is corrupted on both checkpoints, and
some prepare  is corrupted in the log of a catching up replica. There
are three corruptions, but as only two are correlated, the cluster
should be live. However in this situation effectively any corruption in
the grid "intersects" with any corruption in the WAL.

The solution here is to _extend_ durabilities of prepares, such that we
only start overwriting prepares from the old checkpoint once the new one
is confirmed canonical. Before, the WAL looked like this:

```
    |<-checkpoint->|<-lsm_bar->|
                               ^
                            trigger
```

We shorten the checkpoint part to make sure we can fit an extra pipeline
worth of messages in:

```
    |<-checkpoint->|<-lsm_bar->|<-pipeline->|
                               ^
                            trigger
```

This gives us the following invariant:

A prepare can only be evicted from the WAL once there's a _quorum_ of
checkpoints covering this prepare.
@matklad matklad force-pushed the matklad/checkpoint-interval branch from 5ad4bfa to a88f896 Compare December 19, 2023 16:51
@matklad matklad added this pull request to the merge queue Dec 20, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 20, 2023
@matklad matklad added this pull request to the merge queue Dec 20, 2023
Merged via the queue into main with commit b4caaa5 Dec 20, 2023
27 checks passed
@matklad matklad deleted the matklad/checkpoint-interval branch December 20, 2023 16:34
sentientwaffle added a commit that referenced this pull request Dec 22, 2023
Thanks to #1346, the first ops prepared immediately after a checkpoint no longer overwrite prepares from the previous checkpoint, which neatly solves this issue (without needing async checkpoints)!
sentientwaffle added a commit that referenced this pull request Dec 22, 2023
Thanks to #1346, the first ops prepared immediately after a checkpoint no longer overwrite prepares from the previous checkpoint, which neatly solves this issue (without needing async checkpoints)!
sentientwaffle added a commit that referenced this pull request Jan 18, 2024
Thanks to #1346, there is no risk in a R=2 cluster of the leading replica overwriting a prepare required by the lagging replica. This workaround is not needed anymore!

(Removing the `"Cluster: sync: checkpoint diverges, sync (primary diverges)"` test in a separate commit to emphasize that the test still passes, it just isn't really needed anymore since it doesn't exercise anything interesting anymore.)
sentientwaffle added a commit that referenced this pull request Jan 18, 2024
Thanks to #1346, this test doesn't actually test anything interesting.
sentientwaffle added a commit that referenced this pull request Jan 18, 2024
Thanks to #1346, there is no risk in a R=2 cluster of the leading replica overwriting a prepare required by the lagging replica. This workaround is not needed anymore!

(Removing the `"Cluster: sync: checkpoint diverges, sync (primary diverges)"` test in a separate commit to emphasize that the test still passes, it just isn't really needed anymore since it doesn't exercise anything interesting anymore.)
sentientwaffle added a commit that referenced this pull request Jan 18, 2024
Thanks to #1346, this test doesn't actually test anything interesting.
sentientwaffle added a commit that referenced this pull request Jan 18, 2024
Thanks to #1346, this test doesn't actually test anything interesting.
sentientwaffle added a commit that referenced this pull request Jan 18, 2024
Thanks to #1346, there is no risk in a R=2 cluster of the leading replica overwriting a prepare required by the lagging replica. This workaround is not needed anymore!

(Removing the `"Cluster: sync: checkpoint diverges, sync (primary diverges)"` test in a separate commit to emphasize that the test still passes, it just isn't really needed anymore since it doesn't exercise anything interesting anymore.)
sentientwaffle added a commit that referenced this pull request Jan 18, 2024
Thanks to #1346, this test doesn't actually test anything interesting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants