vsr: increase prepare durability near checkpoint boundary #1346

matklad · 2023-12-07T14:05:20Z

This fixes the following simulator failure:

$ git reset --hard 87cace1
$ ./zig/zig build simulator_run -Dsimulator-state-machine=testing -- 9854840972719737644

This is what happening:

two replicas committed checkpoint trigger (op=59) and prepared the next op=60.
two replicas are far behind, want to state sync, and need a canonical checkpoint.
one replica has op=28 corrupted in the log and wants to repair it

The key issue is that preparing op=60 already pushes op=28 out from the log, but doesn't yet make the checkpoint canonical (as op=60 isn't committed). As a result:

the two lagging replicas can't sync, because there's no canonical checkpoint.
the repairing replica can't catch up, because corrupted op is missing from the cluster.
the two up-to-date replicas can't commit op=60, because there isn't a prepare quorum with a fresh checkpoint.

Taking a step back, the core issue here is that we don't have enough physical durability for op=28. Logically, it is present on three replicas, once in a WAL (where it got corrupted) and twice as data in the checkpoint. However, as these are two different representations, the cluster is not always capable of reconciling them.

It is useful to consider a different scenario of corruptions here: suppose that a single grid block is corrupted on both checkpoints, and some prepare is corrupted in the log of a catching up replica. There are three corruptions, but as only two are correlated, the cluster should be live. However in this situation effectively any corruption in the grid "intersects" with any corruption in the WAL.

The solution here is to extend durabilities of prepares, such that we only start overwriting prepares from the old checkpoint once the new one is confirmed canonical. Before, the WAL looked like this:

    |<-checkpoint->|<-lsm_bar->|
                               ^
                            trigger

We shorten the checkpoint part to make sure we can fit an extra pipeline worth of messages in:

    |<-checkpoint->|<-lsm_bar->|<-pipeline->|
                               ^
                            trigger

This gives us the following invariant:

A prepare can only be evicted from the WAL once there's a quorum of checkpoints covering this prepare.

matklad · 2023-12-07T14:28:50Z

src/simulator.zig

+                    if (normal) {
+                        // Case A) all replicas are normal, but commit_max is exactly checkpoint
+                        // trigger.
+                        return vsr.Checkpoint.trigger_for_checkpoint(checkpoint_max) == commit_max;


I spent an embarassing amount of time trying to find a bug in my changes that fails

zig/zig build simulator_run -Dsimulator-state-machine=testing -Dsimulator-log=short -Doptimize=ReleaseSafe -- 15383775152417957975

only to realize that this is a pre-existing failure!

Before, we were only checking for missing checkpoint in view-change, but it could be the case that in normal one replica is ahead, and is at trigger.

sentientwaffle · 2023-12-07T14:45:03Z

src/simulator.zig

+    // B) During view change, there is a canonical checkpoint, but replicas can't learn about
+    //    because commit messages don't flow during view change.


Suggested change

// B) During view change, there is a canonical checkpoint, but replicas can't learn about

// because commit messages don't flow during view change.

// B) During view change, there is a canonical checkpoint, but replicas can't learn about it

// because commit messages don't flow during view change.

Typo. But also -- is this correct? Checkpoint ids are on ping messages too.

Replicas learn checkpoint it itself, but they can't learn that it is canonical, because they don't know commit_max.

If there are enough replicas on the new checkpoint though, then we don't need to know commit_max:

if (candidates_matching >= candidates_threshold) { // Fallback case: // In an idle cluster, where the last commit landed on a checkpoint boundary, // we still want to be able to sync a lagging replica. log.debug("{}: on_{s}: jump_sync_target: " ++ "candidate checkpoint is canonical (quorum: {}/{})", .{ self.replica, @tagName(header.command), candidates_matching, candidates_threshold, }); break :canonical true; }

Adjusted comment:

there aren't enough replicas in core at a given checkpoint, so they can't prove checkpoint canonical from pings

and they also can't learn about canonicity from commit messages, because there aren't any

src/simulator.zig

src/vsr/replica.zig

sentientwaffle · 2023-12-07T14:54:46Z

src/vsr/replica.zig

+                self.op_checkpoint_next_trigger() -| constants.vsr_checkpoint_interval,
+                self.op_checkpoint_next_trigger() -| constants.vsr_checkpoint_interval * 2,


So much more readable! ❤️

sentientwaffle · 2023-12-07T16:06:43Z

src/vsr.zig

                // First wrap: op_checkpoint_next = 8-2-1 = 5
-                break :op constants.journal_slot_count - constants.lsm_batch_multiple - 1;
+                break :op constants.vsr_checkpoint_interval - 1;
            } else {
                // Second wrap: op_checkpoint_next = 5+8-2 = 11
                // Third wrap: op_checkpoint_next = 11+8-2 = 17


Can you update these "First/Second/Third wrap" notes to include a pipeline?
Also the diagram above pub const Checkpoint

matklad · 2023-12-07T16:25:49Z

src/vsr.zig

@@ -1293,7 +1293,8 @@ const ViewChangeHeadersArray = struct {
        assert(headers.command == .start_view);
        // This function is only called by a replica that is lagging behind the primary's
        // checkpoint, so the start_view has a full suffix of headers.
-        assert(headers.array.get(0).op >= constants.journal_slot_count);


TBH, I don't fully understand this assert. Is it checking that, if the primary is on the "next" checkpoint, then we have at least checkpoint+trigger worth of messages?

But is this actually correct? The first checkpoint is a bit shorter, due to - 1, no?

Thinking out loud:

start_view_into_do_view_change is called by a replica that is transitioning from normal -> view_change, but is lagging behind the primary by at least a checkpoint, and the SV it has from the old view is from that newer checkpoint.

Using the checkpoint diagram for the purpose of mental model...:

/// commit log (ops) │ write-ahead log (slots) /// 0 4 8 2 6 0 4 │ 0---4--- /// 0 ─────✓·% │ 01234✓6% initial log fill /// 1 ───────────✓·% │ 890✓2%45 first wrap of log /// 2 ─────────────────✓·% │ 6✓8%0123 second wrap of log /// 3 ───────────────────────✓·% │ 4%67890✓ third wrap of log

In the first wrap, that means that our self.op < 8, so headers.array.get(0).op >= 8. 8 - 1 is our trigger, so it is the highest op that could possibly be in our log.

So I think that the old assert was correct

Got it! That -1 in the computation of the frist checkpoint threw me into a loop. I was thinking we have some fancy logic where op=0 doesn't actually go into the loog, because it's a root op or something (so, the first checkpoint is physically shorter than the rest). But that -1 is just count <-> index conversion --- the first checkpoint has exactly checkpoint_interval ops (like all the other checkpoints), the first op number is 0, the last op number is len - 1

matklad · 2023-12-18T16:46:30Z

src/vsr/replica.zig

            }
-            return self.superblock.working.checkpoint_id();


I think this wasn't entirely correct actually! For ops far in the future, we'd use current checkpoint ID, rather than null.

matklad · 2023-12-18T16:53:42Z

src/vsr.zig

 ///
 ///   checkpoint() call           0   1   2   3
 ///   op_checkpoint               0   5  11  17
 ///   op_checkpoint_next          5  11  17  23
 ///   op_checkpoint_next_trigger  7  13  19  25
 ///
 ///     commit log (ops)           │ write-ahead log (slots)
-///     0   4   8   2   6   0   4  │ 0---4---
-///   0 ─────✓·%                   │ 01234✓6%   initial log fill
-///   1 ───────────✓·%             │ 890✓2%45   first wrap of log


It was a bit confusing how, on the right, we show full wrap of the log (up to op=15), while on the left we are up to 13. In the new version, I made left and right to show the same data

matklad · 2023-12-18T16:55:06Z

src/vsr.zig

            } else {
-                // Second wrap: op_checkpoint_next = 5+8-2 = 11
-                // Third wrap: op_checkpoint_next = 11+8-2 = 17


I figured it would be less confusing to not decompose checkpoint interval. After all, it is morally a parameter, and can be set arbitrary, as long as it allows at least a trigger plus pipeline of slack.

sentientwaffle · 2023-12-18T19:21:06Z

src/vsr.zig

+///     0   4   8   2   6   0   4  │ 0 - - - 4 - - - 8
+///   0 ─────✓·%                   │[0 1 2 3 4 ✓]6 % R
+///   1 ───────────✓·%             │ 9 0 ✓]2 % 5[6 7 8
+///   2 ─────────────────✓·%       │ 8 % 1[2 3 4 5 6 ✓]
+///   3 ───────────────────────✓·% │[8 9 0 1 2 ✓]4 % 7


Numbering + brackets are off, I think?:

Suggested change

/// 0 4 8 2 6 0 4 │ 0 - - - 4 - - - 8

/// 0 ─────✓·% │[0 1 2 3 4 ✓]6 % R

/// 1 ───────────✓·% │ 9 0 ✓]2 % 5[6 7 8

/// 2 ─────────────────✓·% │ 8 % 1[2 3 4 5 6 ✓]

/// 3 ───────────────────────✓·% │[8 9 0 1 2 ✓]4 % 7

/// 0 4 8 2 6 0 4 │ 0 - - - 4 - - - 8

/// 0 ─────✓·% │ 0 1 2 3 4 ✓]6 % R

/// 1 ───────────✓·% │ 9 0 ✓]2 % 4 5 6 ✓]

/// 2 ─────────────────✓·% │ 8 % 0 1 2 ✓]4 % 6

/// 3 ───────────────────────✓·% │ 7 8 ✓]0 % 2 3 4 ✓]

I think it is correct, but there's high chance we might be reading the chart differently, as I find the original version confusing.

Eg, in the second row/checkpoint, when I write 9 0 ✓]2 % 5[6 7 8, that 5 is correct --- it is a 5 (not 15) from the first checkpoint which we haven't yet overwritten. And the checkpoint is [6 7 8 9 10 11].

I think the original version of the chart on the right showed full log wraps, with all entries overwritten, but that I think is confusing --- we have full wraps on the right, but full checkpoints on the left.

Now that I think about it, theer's no reason to compress 10 to just 0 on the right --- there's space to write full ops, ptal!

Ahh I understand now, thank you! Yes, we were reading the chart differently, but your version is much more useful than mine 🎉

This fixes the following simulator failure: $ git reset --hard 87cace1 $ ./zig/zig build simulator_run -Dsimulator-state-machine=testing -- 9854840972719737644 This is what happening: - two replicas committed checkpoint trigger (op=59) and prepared the next op=60. - two replicas are far behind, want to state sync, and need a canonical checkpoint. - one replica has op=28 corrupted in the log and wants to repair it The key issue is that preparing op=60 already pushes op=28 out from the log, but doesn't yet make the checkpoint canonical (as op=60 isn't committed). As a result: - the two lagging replicas can't sync, because there's no canonical checkpoint. - the repairing replica can't catch up, because corrupted op is missing from the cluster. - the two up-to-date replicas can't commit op=60, because there isn't a prepare quorum with a fresh checkpoint. Taking a step back, the core issue here is that we don't have enough _physical_ durability for op=28. _Logically_, it is present on three replicas, once in a WAL (where it got corrupted) and twice as data in the checkpoint. However, as these are two different representations, the cluster is not always capable of reconciling them. It is useful to consider a different scenario of corruptions here: suppose that a single grid block is corrupted on both checkpoints, and some prepare is corrupted in the log of a catching up replica. There are three corruptions, but as only two are correlated, the cluster should be live. However in this situation effectively any corruption in the grid "intersects" with any corruption in the WAL. The solution here is to _extend_ durabilities of prepares, such that we only start overwriting prepares from the old checkpoint once the new one is confirmed canonical. Before, the WAL looked like this: ``` |<-checkpoint->|<-lsm_bar->| ^ trigger ``` We shorten the checkpoint part to make sure we can fit an extra pipeline worth of messages in: ``` |<-checkpoint->|<-lsm_bar->|<-pipeline->| ^ trigger ``` This gives us the following invariant: A prepare can only be evicted from the WAL once there's a _quorum_ of checkpoints covering this prepare.

Thanks to #1346, the first ops prepared immediately after a checkpoint no longer overwrite prepares from the previous checkpoint, which neatly solves this issue (without needing async checkpoints)!

Thanks to #1346, there is no risk in a R=2 cluster of the leading replica overwriting a prepare required by the lagging replica. This workaround is not needed anymore! (Removing the `"Cluster: sync: checkpoint diverges, sync (primary diverges)"` test in a separate commit to emphasize that the test still passes, it just isn't really needed anymore since it doesn't exercise anything interesting anymore.)

Thanks to #1346, this test doesn't actually test anything interesting.

Thanks to #1346, there is no risk in a R=2 cluster of the leading replica overwriting a prepare required by the lagging replica. This workaround is not needed anymore! (Removing the `"Cluster: sync: checkpoint diverges, sync (primary diverges)"` test in a separate commit to emphasize that the test still passes, it just isn't really needed anymore since it doesn't exercise anything interesting anymore.)

Thanks to #1346, this test doesn't actually test anything interesting.

Thanks to #1346, there is no risk in a R=2 cluster of the leading replica overwriting a prepare required by the lagging replica. This workaround is not needed anymore! (Removing the `"Cluster: sync: checkpoint diverges, sync (primary diverges)"` test in a separate commit to emphasize that the test still passes, it just isn't really needed anymore since it doesn't exercise anything interesting anymore.)

Thanks to #1346, this test doesn't actually test anything interesting.

matklad commented Dec 7, 2023

View reviewed changes

matklad force-pushed the matklad/checkpoint-interval branch from 28a764b to 9a735b1 Compare December 7, 2023 14:31

matklad marked this pull request as ready for review December 7, 2023 14:33

matklad requested a review from sentientwaffle December 7, 2023 14:33

sentientwaffle reviewed Dec 7, 2023

View reviewed changes

matklad commented Dec 7, 2023

View reviewed changes

matklad force-pushed the matklad/checkpoint-interval branch 5 times, most recently from 232d936 to eda53b8 Compare December 18, 2023 16:40

matklad commented Dec 18, 2023

View reviewed changes

matklad force-pushed the matklad/checkpoint-interval branch from eda53b8 to f0dfd19 Compare December 18, 2023 16:48

matklad commented Dec 18, 2023

View reviewed changes

matklad force-pushed the matklad/checkpoint-interval branch from f0dfd19 to 626d1fe Compare December 18, 2023 17:10

matklad assigned sentientwaffle Dec 18, 2023

matklad force-pushed the matklad/checkpoint-interval branch from 626d1fe to 6cdc904 Compare December 18, 2023 17:32

sentientwaffle reviewed Dec 18, 2023

View reviewed changes

matklad force-pushed the matklad/checkpoint-interval branch from 6cdc904 to 5ad4bfa Compare December 19, 2023 13:22

sentientwaffle approved these changes Dec 19, 2023

View reviewed changes

matklad enabled auto-merge December 19, 2023 15:15

matklad force-pushed the matklad/checkpoint-interval branch from 5ad4bfa to a88f896 Compare December 19, 2023 16:51

sentientwaffle approved these changes Dec 20, 2023

View reviewed changes

matklad added this pull request to the merge queue Dec 20, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 20, 2023

matklad added this pull request to the merge queue Dec 20, 2023

Merged via the queue into main with commit b4caaa5 Dec 20, 2023
27 checks passed

matklad deleted the matklad/checkpoint-interval branch December 20, 2023 16:34

matklad mentioned this pull request Dec 20, 2023

Liveness: 3097193340868131919 #1283

Closed

sentientwaffle mentioned this pull request Dec 22, 2023

VSR: Remove ignore_request_message() hack #1394

Merged

sentientwaffle added a commit that referenced this pull request Jan 18, 2024

VSR/Sync: Remove no-longer-interesting test

c40dc00

Thanks to #1346, this test doesn't actually test anything interesting.

sentientwaffle mentioned this pull request Jan 18, 2024

VSR/Sync: Remove jump_sync_target() kludge for R=2 #1440

Merged

sentientwaffle added a commit that referenced this pull request Jan 18, 2024

VSR/Sync: Remove no-longer-interesting test

ffc0296

Thanks to #1346, this test doesn't actually test anything interesting.

sentientwaffle added a commit that referenced this pull request Jan 18, 2024

VSR/Sync: Remove no-longer-interesting test

be7bc39

Thanks to #1346, this test doesn't actually test anything interesting.

sentientwaffle added a commit that referenced this pull request Jan 18, 2024

VSR/Sync: Remove no-longer-interesting test

ca6e611

Thanks to #1346, this test doesn't actually test anything interesting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vsr: increase prepare durability near checkpoint boundary #1346

vsr: increase prepare durability near checkpoint boundary #1346

matklad commented Dec 7, 2023

matklad Dec 7, 2023

sentientwaffle Dec 7, 2023

matklad Dec 7, 2023

sentientwaffle Dec 7, 2023

matklad Dec 18, 2023

sentientwaffle Dec 7, 2023

sentientwaffle Dec 7, 2023

matklad Dec 7, 2023

sentientwaffle Dec 7, 2023

matklad Dec 18, 2023

matklad Dec 18, 2023

matklad Dec 18, 2023

matklad Dec 18, 2023

sentientwaffle Dec 18, 2023

matklad Dec 19, 2023

sentientwaffle Dec 19, 2023

		// B) During view change, there is a canonical checkpoint, but replicas can't learn about
		// because commit messages don't flow during view change.

		self.op_checkpoint_next_trigger() -\| constants.vsr_checkpoint_interval,
		self.op_checkpoint_next_trigger() -\| constants.vsr_checkpoint_interval * 2,

vsr: increase prepare durability near checkpoint boundary #1346

vsr: increase prepare durability near checkpoint boundary #1346

Conversation

matklad commented Dec 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment