-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vsr: increase prepare durability near checkpoint boundary #1346
Conversation
if (normal) { | ||
// Case A) all replicas are normal, but commit_max is exactly checkpoint | ||
// trigger. | ||
return vsr.Checkpoint.trigger_for_checkpoint(checkpoint_max) == commit_max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent an embarassing amount of time trying to find a bug in my changes that fails
zig/zig build simulator_run -Dsimulator-state-machine=testing -Dsimulator-log=short -Doptimize=ReleaseSafe -- 15383775152417957975
only to realize that this is a pre-existing failure!
Before, we were only checking for missing checkpoint in view-change, but it could be the case that in normal one replica is ahead, and is at trigger.
28a764b
to
9a735b1
Compare
src/simulator.zig
Outdated
// B) During view change, there is a canonical checkpoint, but replicas can't learn about | ||
// because commit messages don't flow during view change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// B) During view change, there is a canonical checkpoint, but replicas can't learn about | |
// because commit messages don't flow during view change. | |
// B) During view change, there is a canonical checkpoint, but replicas can't learn about it | |
// because commit messages don't flow during view change. |
Typo. But also -- is this correct? Checkpoint ids are on ping messages too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replicas learn checkpoint it itself, but they can't learn that it is canonical, because they don't know commit_max
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are enough replicas on the new checkpoint though, then we don't need to know commit_max
:
if (candidates_matching >= candidates_threshold) {
// Fallback case:
// In an idle cluster, where the last commit landed on a checkpoint boundary,
// we still want to be able to sync a lagging replica.
log.debug("{}: on_{s}: jump_sync_target: " ++
"candidate checkpoint is canonical (quorum: {}/{})", .{
self.replica,
@tagName(header.command),
candidates_matching,
candidates_threshold,
});
break :canonical true;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted comment:
- there aren't enough replicas in core at a given checkpoint, so they can't prove checkpoint canonical from pings
- and they also can't learn about canonicity from commit messages, because there aren't any
self.op_checkpoint_next_trigger() -| constants.vsr_checkpoint_interval, | ||
self.op_checkpoint_next_trigger() -| constants.vsr_checkpoint_interval * 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much more readable! ❤️
src/vsr.zig
Outdated
// First wrap: op_checkpoint_next = 8-2-1 = 5 | ||
break :op constants.journal_slot_count - constants.lsm_batch_multiple - 1; | ||
break :op constants.vsr_checkpoint_interval - 1; | ||
} else { | ||
// Second wrap: op_checkpoint_next = 5+8-2 = 11 | ||
// Third wrap: op_checkpoint_next = 11+8-2 = 17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update these "First/Second/Third wrap" notes to include a pipeline?
Also the diagram above pub const Checkpoint
@@ -1293,7 +1293,8 @@ const ViewChangeHeadersArray = struct { | |||
assert(headers.command == .start_view); | |||
// This function is only called by a replica that is lagging behind the primary's | |||
// checkpoint, so the start_view has a full suffix of headers. | |||
assert(headers.array.get(0).op >= constants.journal_slot_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I don't fully understand this assert. Is it checking that, if the primary is on the "next" checkpoint, then we have at least checkpoint+trigger worth of messages?
But is this actually correct? The first checkpoint is a bit shorter, due to - 1
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud:
start_view_into_do_view_change
is called by a replica that is transitioning from normal -> view_change, but is lagging behind the primary by at least a checkpoint, and the SV it has from the old view is from that newer checkpoint.
Using the checkpoint diagram for the purpose of mental model...:
/// commit log (ops) │ write-ahead log (slots)
/// 0 4 8 2 6 0 4 │ 0---4---
/// 0 ─────✓·% │ 01234✓6% initial log fill
/// 1 ───────────✓·% │ 890✓2%45 first wrap of log
/// 2 ─────────────────✓·% │ 6✓8%0123 second wrap of log
/// 3 ───────────────────────✓·% │ 4%67890✓ third wrap of log
- In the first wrap, that means that our
self.op < 8
, soheaders.array.get(0).op >= 8
.8 - 1
is our trigger, so it is the highest op that could possibly be in our log. - So I think that the old assert was correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! That -1
in the computation of the frist checkpoint threw me into a loop. I was thinking we have some fancy logic where op=0
doesn't actually go into the loog, because it's a root op or something (so, the first checkpoint is physically shorter than the rest). But that -1
is just count <-> index conversion --- the first checkpoint has exactly checkpoint_interval
ops (like all the other checkpoints), the first op number is 0, the last op number is len - 1
232d936
to
eda53b8
Compare
} | ||
return self.superblock.working.checkpoint_id(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this wasn't entirely correct actually! For ops far in the future, we'd use current checkpoint ID, rather than null
.
eda53b8
to
f0dfd19
Compare
/// | ||
/// checkpoint() call 0 1 2 3 | ||
/// op_checkpoint 0 5 11 17 | ||
/// op_checkpoint_next 5 11 17 23 | ||
/// op_checkpoint_next_trigger 7 13 19 25 | ||
/// | ||
/// commit log (ops) │ write-ahead log (slots) | ||
/// 0 4 8 2 6 0 4 │ 0---4--- | ||
/// 0 ─────✓·% │ 01234✓6% initial log fill | ||
/// 1 ───────────✓·% │ 890✓2%45 first wrap of log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a bit confusing how, on the right, we show full wrap of the log (up to op=15), while on the left we are up to 13. In the new version, I made left and right to show the same data
} else { | ||
// Second wrap: op_checkpoint_next = 5+8-2 = 11 | ||
// Third wrap: op_checkpoint_next = 11+8-2 = 17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured it would be less confusing to not decompose checkpoint interval. After all, it is morally a parameter, and can be set arbitrary, as long as it allows at least a trigger plus pipeline of slack.
f0dfd19
to
626d1fe
Compare
626d1fe
to
6cdc904
Compare
src/vsr.zig
Outdated
/// 0 4 8 2 6 0 4 │ 0 - - - 4 - - - 8 | ||
/// 0 ─────✓·% │[0 1 2 3 4 ✓]6 % R | ||
/// 1 ───────────✓·% │ 9 0 ✓]2 % 5[6 7 8 | ||
/// 2 ─────────────────✓·% │ 8 % 1[2 3 4 5 6 ✓] | ||
/// 3 ───────────────────────✓·% │[8 9 0 1 2 ✓]4 % 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numbering + brackets are off, I think?:
/// 0 4 8 2 6 0 4 │ 0 - - - 4 - - - 8 | |
/// 0 ─────✓·% │[0 1 2 3 4 ✓]6 % R | |
/// 1 ───────────✓·% │ 9 0 ✓]2 % 5[6 7 8 | |
/// 2 ─────────────────✓·% │ 8 % 1[2 3 4 5 6 ✓] | |
/// 3 ───────────────────────✓·% │[8 9 0 1 2 ✓]4 % 7 | |
/// 0 4 8 2 6 0 4 │ 0 - - - 4 - - - 8 | |
/// 0 ─────✓·% │ 0 1 2 3 4 ✓]6 % R | |
/// 1 ───────────✓·% │ 9 0 ✓]2 % 4 5 6 ✓] | |
/// 2 ─────────────────✓·% │ 8 % 0 1 2 ✓]4 % 6 | |
/// 3 ───────────────────────✓·% │ 7 8 ✓]0 % 2 3 4 ✓] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is correct, but there's high chance we might be reading the chart differently, as I find the original version confusing.
Eg, in the second row/checkpoint, when I write 9 0 ✓]2 % 5[6 7 8
, that 5
is correct --- it is a 5 (not 15) from the first checkpoint which we haven't yet overwritten. And the checkpoint is [6 7 8 9 10 11]
.
I think the original version of the chart on the right showed full log wraps, with all entries overwritten, but that I think is confusing --- we have full wraps on the right, but full checkpoints on the left.
Now that I think about it, theer's no reason to compress 10
to just 0
on the right --- there's space to write full ops, ptal!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I understand now, thank you! Yes, we were reading the chart differently, but your version is much more useful than mine 🎉
6cdc904
to
5ad4bfa
Compare
This fixes the following simulator failure: $ git reset --hard 87cace1 $ ./zig/zig build simulator_run -Dsimulator-state-machine=testing -- 9854840972719737644 This is what happening: - two replicas committed checkpoint trigger (op=59) and prepared the next op=60. - two replicas are far behind, want to state sync, and need a canonical checkpoint. - one replica has op=28 corrupted in the log and wants to repair it The key issue is that preparing op=60 already pushes op=28 out from the log, but doesn't yet make the checkpoint canonical (as op=60 isn't committed). As a result: - the two lagging replicas can't sync, because there's no canonical checkpoint. - the repairing replica can't catch up, because corrupted op is missing from the cluster. - the two up-to-date replicas can't commit op=60, because there isn't a prepare quorum with a fresh checkpoint. Taking a step back, the core issue here is that we don't have enough _physical_ durability for op=28. _Logically_, it is present on three replicas, once in a WAL (where it got corrupted) and twice as data in the checkpoint. However, as these are two different representations, the cluster is not always capable of reconciling them. It is useful to consider a different scenario of corruptions here: suppose that a single grid block is corrupted on both checkpoints, and some prepare is corrupted in the log of a catching up replica. There are three corruptions, but as only two are correlated, the cluster should be live. However in this situation effectively any corruption in the grid "intersects" with any corruption in the WAL. The solution here is to _extend_ durabilities of prepares, such that we only start overwriting prepares from the old checkpoint once the new one is confirmed canonical. Before, the WAL looked like this: ``` |<-checkpoint->|<-lsm_bar->| ^ trigger ``` We shorten the checkpoint part to make sure we can fit an extra pipeline worth of messages in: ``` |<-checkpoint->|<-lsm_bar->|<-pipeline->| ^ trigger ``` This gives us the following invariant: A prepare can only be evicted from the WAL once there's a _quorum_ of checkpoints covering this prepare.
5ad4bfa
to
a88f896
Compare
Thanks to #1346, the first ops prepared immediately after a checkpoint no longer overwrite prepares from the previous checkpoint, which neatly solves this issue (without needing async checkpoints)!
Thanks to #1346, the first ops prepared immediately after a checkpoint no longer overwrite prepares from the previous checkpoint, which neatly solves this issue (without needing async checkpoints)!
Thanks to #1346, there is no risk in a R=2 cluster of the leading replica overwriting a prepare required by the lagging replica. This workaround is not needed anymore! (Removing the `"Cluster: sync: checkpoint diverges, sync (primary diverges)"` test in a separate commit to emphasize that the test still passes, it just isn't really needed anymore since it doesn't exercise anything interesting anymore.)
Thanks to #1346, this test doesn't actually test anything interesting.
Thanks to #1346, there is no risk in a R=2 cluster of the leading replica overwriting a prepare required by the lagging replica. This workaround is not needed anymore! (Removing the `"Cluster: sync: checkpoint diverges, sync (primary diverges)"` test in a separate commit to emphasize that the test still passes, it just isn't really needed anymore since it doesn't exercise anything interesting anymore.)
Thanks to #1346, this test doesn't actually test anything interesting.
Thanks to #1346, this test doesn't actually test anything interesting.
Thanks to #1346, there is no risk in a R=2 cluster of the leading replica overwriting a prepare required by the lagging replica. This workaround is not needed anymore! (Removing the `"Cluster: sync: checkpoint diverges, sync (primary diverges)"` test in a separate commit to emphasize that the test still passes, it just isn't really needed anymore since it doesn't exercise anything interesting anymore.)
Thanks to #1346, this test doesn't actually test anything interesting.
This fixes the following simulator failure:
$ git reset --hard 87cace1
$ ./zig/zig build simulator_run -Dsimulator-state-machine=testing -- 9854840972719737644
This is what happening:
The key issue is that preparing op=60 already pushes op=28 out from the log, but doesn't yet make the checkpoint canonical (as op=60 isn't committed). As a result:
Taking a step back, the core issue here is that we don't have enough physical durability for op=28. Logically, it is present on three replicas, once in a WAL (where it got corrupted) and twice as data in the checkpoint. However, as these are two different representations, the cluster is not always capable of reconciling them.
It is useful to consider a different scenario of corruptions here: suppose that a single grid block is corrupted on both checkpoints, and some prepare is corrupted in the log of a catching up replica. There are three corruptions, but as only two are correlated, the cluster should be live. However in this situation effectively any corruption in the grid "intersects" with any corruption in the WAL.
The solution here is to extend durabilities of prepares, such that we only start overwriting prepares from the old checkpoint once the new one is confirmed canonical. Before, the WAL looked like this:
We shorten the checkpoint part to make sure we can fit an extra pipeline worth of messages in:
This gives us the following invariant:
A prepare can only be evicted from the WAL once there's a quorum of checkpoints covering this prepare.