Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: test *Replica like *raft.RawNode #105177

Open
tbg opened this issue Jun 20, 2023 · 1 comment
Open

kvserver: test *Replica like *raft.RawNode #105177

tbg opened this issue Jun 20, 2023 · 1 comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Jun 20, 2023

There are lots of "situations of interest" that we attempt to exercise by setting
up a TestCluster and an often complicated web of interceptors, all to
enact a desired "interesting" state of the system. This often involves a
delicate dance between wanting to get into this state quickly (not adding a
slow test) but also not having it flake due to excessively low timeout (or
because some unexpected moving part interferes with the test). There is a time
and place for such a test, but we are currently forced into them because we
often lack a direct way to test at more intermediate levels. This is a problem.

We've seen the benefits of architecting for testability in etcd-io/raft
(applied to *RawNode). A *RawNode has no concurrency, i.e. all access to it
is serial. A set of *RawNodes forms a raft group. The test harness controls
in which order messages get delivered and processed. Verbose logs that result
from handling messages become part of the datadriven output. Everything is
deterministic. The test harness (introduced by us!) combined with the
concurrency-free, deterministic architecture has significantly improved the
ease and quality of testing12 in etcd/raft.

It stands to reason that the same approach could yield similar benefits if
applied to at least the lower slices in the Replica.Send stack.

Squinting a little (lot), a *Replica that is only accessed in a
single-threaded way is just a *RawNode: it can receive various kinds of
messages (e.g. BatchRequest, raftpb.Message, etc), which it handles
primarily via the raft handling loop; a set of Replicas (for the same range) is
like a Raft group.

Not all of the surface of Replica lends itself to this approach (after all,
some bugs will only surface with concurrency on the Replica, some requests today trigger work on a goroutine, etc) but my intuition and experience
is that with each slice of the stack that we make accessible to the above
strategy will confer a significant improvement of quality.

This broad suggestion leaves much to the imagination and perhaps doesn't make
it clear that in its entirety, this is a significant undertaking. However, small
slices of the stack can be tackled in a bottom-up order and in fact, this is
already happening and has been well-received by all participants:

If we could unit test log application, conceivably it would be much easier to
do projects such as #97779 because
we'd be able to directly explore the edge cases in this code, whereas currently
we have to rely on either contrived, hard-to-maintain (and non-exhaustive)
tests, plus randomized tests (which come with their own challenges).

A first "slice" to tackle could be improving the test in
#104401: it creates an
"interesting" raft log programmatically (without even a *Replica), but
currently jumps through some hoops to do so. We could fill these gaps by
successive small refactors, until we have the ability to create arbitrary Raft
logs.

If we then tackled #75729,
i.e. gain the ability to also "apply" these logs in a setting with few moving
parts, we are well underway to being able to comprehensively test the apply
pipeline, which is just not possible today.

We could then work on our ability to instantiate a *Replica "without moving
parts" (i.e. no queues, Raft scheduler, etc) and start writing deterministic
tests.

Jira issue: CRDB-28912

Epic CRDB-39898

Footnotes

  1. https://github.com/etcd-io/raft/blob/6bf4f7fe3122b064e0a0d76314298dca6f379fc7/interaction_test.go#L25-L35

  2. https://github.com/etcd-io/raft/blob/7302ee6f8351076b10a13aa00fb6200de8693e58/testdata/confchange_v2_add_double_auto.txt

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication labels Jun 20, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 20, 2023

cc @cockroachdb/replication

tbg added a commit to tbg/cockroach that referenced this issue Jul 13, 2023
…WithNewLeaseIndexError

These tests don't work with v2 and aren't maintainable or super useful anyway.

We have good randomized coverage of these code paths now via kvnemesis and
TestProposalsWithInjectedLeaseIndexAndReproposalError, and adding better unit
test coverage is tracked in
cockroachdb#106504. In addition to this, it
would be desirable to extend this unit test coverage upwards through the stack
via cockroachdb#105177.
tbg added a commit to tbg/cockroach that referenced this issue Jul 14, 2023
This addresses the following race:

- n1 runs a ConfChange that adds n2 as a learner.
- n1 sends MsgApp to the learner.
- n1 starts the INITIAL snapshot, say at index 100.
- n2 receives n1's MsgApp. Since it's an uninitialized
  Replica and its log is empty, it rejects this MsgApp.
- n2 receives and applies the INITIAL snapshot, which prompts it to send an
  affirmative MsgAppResp to n1.
- n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success))
- n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the
  rejection comes with a RejectHint (suggested next index to try) of zero,
  which is not in n1's log. In particular, the SnapshotIndex will likely be
  higher than the index of the snapshot actually sent, say 101.
- n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101
  so this is ignored and the follower remains in StateSnapshot.

With this commit, the last two steps cannot happen: n2 transitions straight to
StateReplicate because we step a copy of the affirmative MsgAppResp in. The
later rejection will be dropped, since it is stale (you can't hint at index zero
when you already have a positive index confirmed).

I will add that there is no great testing for the above other than stressing
the test with additional logging, noting the symptoms, and noting that they
disappear with this commit. Scripted testing of this code is within reach[^1]
but is outside of the scope of this PR.

[^1]: cockroachdb#105177

There is an additional bit of brittleness that is silently suppressed by this
commit, but which deserves to be fixed independently because how the problem
gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its
current LastIndex and uses it as the PendingSnapshot for the follower's
Progress.

At the time of writing, MsgAppResp that reconnect the follower to the log but
which are not greater than or equal to PendingSnapshot are ignored. In effect,
this means that perfectly good snapshots are thrown away if they happen to be a
little bit stale. In the example above, the snapshot is stale: PendingSnapshot
is 101, but the snapshot is at index 100. Then how does this commit (mostly)
fix the problem, i.e. why isn't the snapshot discarded? The key is that when we
synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection
hasn't arrived yet and so the follower transitions into StateReplicate with a
Match of 100. This is then enough so that raft recognizes the rejected MsgApp
as stale (since it would regress on durably stored entries). However, there is
an alternative example where the rejection arrives earlier: after the snapshot
index has been picked, but before the follower has been transitioned into
StateReplicate. For this to have a negative effect, an entry has to be appended
to the leader's log between generating the snapshot and handling the rejection.
Without the combination of delegated snapshots and sustained write activity on
the leader, this window is small, and this combination is usually not present
in tests but it may well be relevant in "real" clusters. We track addressing
this in cockroachdb#106813.

Closes cockroachdb#87554.
Closes cockroachdb#97971.
Closes cockroachdb#84242.

Epic: None
Release note (bug fix): removed a source of unnecessary Raft snapshots during
replica movement.
tbg added a commit to tbg/cockroach that referenced this issue Jul 20, 2023
This addresses the following race:

- n1 runs a ConfChange that adds n2 as a learner.
- n1 sends MsgApp to the learner.
- n1 starts the INITIAL snapshot, say at index 100.
- n2 receives n1's MsgApp. Since it's an uninitialized
  Replica and its log is empty, it rejects this MsgApp.
- n2 receives and applies the INITIAL snapshot, which prompts it to send an
  affirmative MsgAppResp to n1.
- n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success))
- n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the
  rejection comes with a RejectHint (suggested next index to try) of zero,
  which is not in n1's log. In particular, the SnapshotIndex will likely be
  higher than the index of the snapshot actually sent, say 101.
- n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101
  so this is ignored and the follower remains in StateSnapshot.

With this commit, the last two steps cannot happen: n2 transitions straight to
StateReplicate because we step a copy of the affirmative MsgAppResp in. The
later rejection will be dropped, since it is stale (you can't hint at index zero
when you already have a positive index confirmed).

I will add that there is no great testing for the above other than stressing
the test with additional logging, noting the symptoms, and noting that they
disappear with this commit. Scripted testing of this code is within reach[^1]
but is outside of the scope of this PR.

[^1]: cockroachdb#105177

There is an additional bit of brittleness that is silently suppressed by this
commit, but which deserves to be fixed independently because how the problem
gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its
current LastIndex and uses it as the PendingSnapshot for the follower's
Progress.

At the time of writing, MsgAppResp that reconnect the follower to the log but
which are not greater than or equal to PendingSnapshot are ignored. In effect,
this means that perfectly good snapshots are thrown away if they happen to be a
little bit stale. In the example above, the snapshot is stale: PendingSnapshot
is 101, but the snapshot is at index 100. Then how does this commit (mostly)
fix the problem, i.e. why isn't the snapshot discarded? The key is that when we
synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection
hasn't arrived yet and so the follower transitions into StateReplicate with a
Match of 100. This is then enough so that raft recognizes the rejected MsgApp
as stale (since it would regress on durably stored entries). However, there is
an alternative example where the rejection arrives earlier: after the snapshot
index has been picked, but before the follower has been transitioned into
StateReplicate. For this to have a negative effect, an entry has to be appended
to the leader's log between generating the snapshot and handling the rejection.
Without the combination of delegated snapshots and sustained write activity on
the leader, this window is small, and this combination is usually not present
in tests but it may well be relevant in "real" clusters. We track addressing
this in cockroachdb#106813.

Closes cockroachdb#87554.
Closes cockroachdb#97971.
Closes cockroachdb#84242.

Epic: None
Release note (bug fix): removed a source of unnecessary Raft snapshots during
replica movement.
tbg added a commit to tbg/cockroach that referenced this issue Jul 21, 2023
This addresses the following race:

- n1 runs a ConfChange that adds n2 as a learner.
- n1 sends MsgApp to the learner.
- n1 starts the INITIAL snapshot, say at index 100.
- n2 receives n1's MsgApp. Since it's an uninitialized
  Replica and its log is empty, it rejects this MsgApp.
- n2 receives and applies the INITIAL snapshot, which prompts it to send an
  affirmative MsgAppResp to n1.
- n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success))
- n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the
  rejection comes with a RejectHint (suggested next index to try) of zero,
  which is not in n1's log. In particular, the SnapshotIndex will likely be
  higher than the index of the snapshot actually sent, say 101.
- n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101
  so this is ignored and the follower remains in StateSnapshot.

With this commit, the last two steps cannot happen: n2 transitions straight to
StateReplicate because we step a copy of the affirmative MsgAppResp in. The
later rejection will be dropped, since it is stale (you can't hint at index zero
when you already have a positive index confirmed).

I will add that there is no great testing for the above other than stressing
the test with additional logging, noting the symptoms, and noting that they
disappear with this commit. Scripted testing of this code is within reach[^1]
but is outside of the scope of this PR.

[^1]: cockroachdb#105177

There is an additional bit of brittleness that is silently suppressed by this
commit, but which deserves to be fixed independently because how the problem
gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its
current LastIndex and uses it as the PendingSnapshot for the follower's
Progress.

At the time of writing, MsgAppResp that reconnect the follower to the log but
which are not greater than or equal to PendingSnapshot are ignored. In effect,
this means that perfectly good snapshots are thrown away if they happen to be a
little bit stale. In the example above, the snapshot is stale: PendingSnapshot
is 101, but the snapshot is at index 100. Then how does this commit (mostly)
fix the problem, i.e. why isn't the snapshot discarded? The key is that when we
synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection
hasn't arrived yet and so the follower transitions into StateReplicate with a
Match of 100. This is then enough so that raft recognizes the rejected MsgApp
as stale (since it would regress on durably stored entries). However, there is
an alternative example where the rejection arrives earlier: after the snapshot
index has been picked, but before the follower has been transitioned into
StateReplicate. For this to have a negative effect, an entry has to be appended
to the leader's log between generating the snapshot and handling the rejection.
Without the combination of delegated snapshots and sustained write activity on
the leader, this window is small, and this combination is usually not present
in tests but it may well be relevant in "real" clusters. We track addressing
this in cockroachdb#106813.

Closes cockroachdb#87554.
Closes cockroachdb#97971.
Closes cockroachdb#84242.

Epic: None
Release note (bug fix): removed a source of unnecessary Raft snapshots during
replica movement.
craig bot pushed a commit that referenced this issue Jul 21, 2023
106793: kvserver: communicate snapshot index back along with snapshot response r=erikgrinaker a=tbg

This addresses the following race:

- n1 runs a ConfChange that adds n2 as a learner.
- n1 sends MsgApp to the learner.
- n1 starts the INITIAL snapshot, say at index 100.
- n2 receives n1's MsgApp. Since it's an uninitialized
  Replica and its log is empty, it rejects this MsgApp.
- n2 receives and applies the INITIAL snapshot, which prompts it to send an
  affirmative MsgAppResp to n1.
- n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success))
- n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the
  rejection comes with a RejectHint (suggested next index to try) of zero,
  which is not in n1's log. In particular, the SnapshotIndex will likely be
  higher than the index of the snapshot actually sent, say 101.
- n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101
  so this is ignored and the follower remains in StateSnapshot.

With this commit, the last two steps cannot happen: n2 transitions straight to
StateReplicate because we step a copy of the affirmative MsgAppResp in. The
later rejection will be dropped, since it is stale (you can't hint at index zero
when you already have a positive index confirmed).

I will add that there is no great testing for the above other than stressing
the test with additional logging, noting the symptoms, and noting that they
disappear with this commit. Scripted testing of this code is within reach[^1]
but is outside of the scope of this PR.

[^1]: #105177

There is an additional bit of brittleness that is silently suppressed by this
commit, but which deserves to be fixed independently because how the problem
gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its
current LastIndex and uses it as the PendingSnapshot for the follower's
Progress.

At the time of writing, MsgAppResp that reconnect the follower to the log but
which are not greater than or equal to PendingSnapshot are ignored. In effect,
this means that perfectly good snapshots are thrown away if they happen to be a
little bit stale. In the example above, the snapshot is stale: PendingSnapshot
is 101, but the snapshot is at index 100. Then how does this commit (mostly)
fix the problem, i.e. why isn't the snapshot discarded? The key is that when we
synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection
hasn't arrived yet and so the follower transitions into StateReplicate with a
Match of 100. This is then enough so that raft recognizes the rejected MsgApp
as stale (since it would regress on durably stored entries). However, there is
an alternative example where the rejection arrives earlier: after the snapshot
index has been picked, but before the follower has been transitioned into
StateReplicate. For this to have a negative effect, an entry has to be appended
to the leader's log between generating the snapshot and handling the rejection.
Without the combination of delegated snapshots and sustained write activity on
the leader, this window is small, and this combination is usually not present
in tests but it may well be relevant in "real" clusters. We track addressing
this in #106813.

Closes #87554.
Closes #97971.
Closes #84242.

Epic: None
Release note (bug fix): removed a source of unnecessary Raft snapshots during
replica movement.


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com>
THardy98 pushed a commit to THardy98/cockroach that referenced this issue Jul 24, 2023
This addresses the following race:

- n1 runs a ConfChange that adds n2 as a learner.
- n1 sends MsgApp to the learner.
- n1 starts the INITIAL snapshot, say at index 100.
- n2 receives n1's MsgApp. Since it's an uninitialized
  Replica and its log is empty, it rejects this MsgApp.
- n2 receives and applies the INITIAL snapshot, which prompts it to send an
  affirmative MsgAppResp to n1.
- n1's RawNode now tracks n2 as StateProbe (via call to ReportSnapshot(success))
- n1 receives the MsgApp rejection; n2 regresses to StateSnapshot because the
  rejection comes with a RejectHint (suggested next index to try) of zero,
  which is not in n1's log. In particular, the SnapshotIndex will likely be
  higher than the index of the snapshot actually sent, say 101.
- n1 receives the affirmative MsgAppResp (for index 100). However, 100 < 101
  so this is ignored and the follower remains in StateSnapshot.

With this commit, the last two steps cannot happen: n2 transitions straight to
StateReplicate because we step a copy of the affirmative MsgAppResp in. The
later rejection will be dropped, since it is stale (you can't hint at index zero
when you already have a positive index confirmed).

I will add that there is no great testing for the above other than stressing
the test with additional logging, noting the symptoms, and noting that they
disappear with this commit. Scripted testing of this code is within reach[^1]
but is outside of the scope of this PR.

[^1]: cockroachdb#105177

There is an additional bit of brittleness that is silently suppressed by this
commit, but which deserves to be fixed independently because how the problem
gets avoided seems accidental and incomplete. When raft requests a snapshot, it notes its
current LastIndex and uses it as the PendingSnapshot for the follower's
Progress.

At the time of writing, MsgAppResp that reconnect the follower to the log but
which are not greater than or equal to PendingSnapshot are ignored. In effect,
this means that perfectly good snapshots are thrown away if they happen to be a
little bit stale. In the example above, the snapshot is stale: PendingSnapshot
is 101, but the snapshot is at index 100. Then how does this commit (mostly)
fix the problem, i.e. why isn't the snapshot discarded? The key is that when we
synchronously step the MsgAppResp(100) into the leader's RawNode, the rejection
hasn't arrived yet and so the follower transitions into StateReplicate with a
Match of 100. This is then enough so that raft recognizes the rejected MsgApp
as stale (since it would regress on durably stored entries). However, there is
an alternative example where the rejection arrives earlier: after the snapshot
index has been picked, but before the follower has been transitioned into
StateReplicate. For this to have a negative effect, an entry has to be appended
to the leader's log between generating the snapshot and handling the rejection.
Without the combination of delegated snapshots and sustained write activity on
the leader, this window is small, and this combination is usually not present
in tests but it may well be relevant in "real" clusters. We track addressing
this in cockroachdb#106813.

Closes cockroachdb#87554.
Closes cockroachdb#97971.
Closes cockroachdb#84242.

Epic: None
Release note (bug fix): removed a source of unnecessary Raft snapshots during
replica movement.
tbg added a commit to tbg/cockroach that referenced this issue Jul 25, 2023
…apshot

We saw this test hang in CI. What likely happened (according to the stacks) is
that a lease transfer that was supposed to be caught by an interceptor never
showed up in the interceptor. The most likely explanation is that it errored
out before it got to evaluation. It then signaled a channel the test was only
prepared to check later, so the test hung (waiting for a channel that was now
never to be touched).

This test is hard to maintain. It would be great (though, for now, out of reach)
to write tests like it in a deterministic framework[^1]

[^1]: see cockroachdb#105177.

For now, fix the test so that when the (so far unknown) error rears its
head again, it will fail properly, so we get to see the error and can
take another pass at fixing the test (separately). Stressing
this commit[^2], we get:

> transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected
> error: refusing to transfer lease to (n3,s3):3 because target may need a Raft
> snapshot: replica in StateProbe

This makes sense. The test wants to exercise the below-raft mechanism, but
the above-raft mechanism also exists and while we didn't want to interact
with it, we sometimes do[^1]

[^1]: somewhat related to cockroachdb#107524
[^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s

Touches cockroachdb#106383.

Epic: None
Release note: None
craig bot pushed a commit that referenced this issue Jul 26, 2023
107265: liveness: allow registering callbacks after start r=erikgrinaker a=tbg

I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart
with additional stores that need to be bootstrapped. In that case, liveness
must be running when the StoreIDs are allocated, but it is not.

Trying to address this problem, I realized that when an auxiliary Store is bootstrapped,
it will create a new replicateQueue, which will register a new callback into NodeLiveness.

But if liveness must be started at this point to fix #106706, we'll run into the assertion
that checks that we don't register callbacks on a started node liveness.

Something's got to give: we will allow registering callbacks at any given point
in time, and they'll get an initial set of notifications synchronously. I
audited the few users of RegisterCallback and this seems OK with all of them.

[^1]: #106706 (comment)

Epic: None
Release note: None


107417: kvserver: ignore RPC conn when deciding to campaign/vote r=erikgrinaker a=erikgrinaker

**kvserver: remove stale mayCampaignOnWake comment**

The comment is about a parameter that no longer exists.

**kvserver: revamp shouldCampaign/Forget tests**

**kvserver: ignore RPC conn in `shouldCampaignOnWake`**

Previously, `shouldCampaignOnWake()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent an unquiescing replica from acquiring Raft leadership if the leader is still alive but unable to heartbeat liveness, and the leader will be unable to acquire epoch leases in this case.

This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state.

**kvserver: ignore RPC conn in `shouldForgetLeaderOnVoteRequest`**

Previously, `shouldForgetLeaderOnVoteRequest()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent granting votes to a new leader that may be attempting to acquire a epoch lease (which the current leader can't).

This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state.

Resolves #107060.
Epic: none
Release note: None

**kvserver: remove `StoreTestingKnobs.DisableLivenessMapConnHealth`**

107424: kvserver: scale Raft entry cache size with system memory r=erikgrinaker a=erikgrinaker

The Raft entry cache size defaulted to 16 MB, which is rather small. This has been seen to cause tail latency and throughput degradation with high write volume on large nodes, correlating with a reduction in the entry cache hit rate.

This patch linearly scales the Raft entry cache size as 1/256 of total system/cgroup memory, shared evenly between all stores, with a minimum 32 MB. For example, a 32 GB 8-vCPU node will have a 128 MB entry cache.

This is a conservative default, since this memory is not accounted for in existing memory budgets nor by the `--cache` flag. We rarely see cache misses in production clusters anyway, and have seen significantly improved hit rates with this scaling (e.g. a 64 KB kv0 workload on 8-vCPU nodes increased from 87% to 99% hit rate).

Resolves #98666.
Epic: none

Release note (performance improvement): The default Raft entry cache size has been increased from 16 MB to 1/256 of system memory with a minimum of 32 MB, divided evenly between all stores. This can be configured via `COCKROACH_RAFT_ENTRY_CACHE_SIZE`.

107442: kvserver: deflake TestRequestsOnFollowerWithNonLiveLeaseholder r=erikgrinaker a=tbg

The test previously relied on aggressive liveness heartbeat expirations to
avoid running for too long. As a result, it was flaky since liveness wasn't
reliably pinned in the way the test wanted.

The hybrid manual clock allows time to jump forward at an opportune moment.
Use it here to avoid running with a tight lease interval.

On my gceworker, previously flaked within a few minutes. As of this commit, I
ran it for double-digit minutes without issue.

Fixes #107200.

Epic: None
Release note: None


107526: kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot r=erikgrinaker a=tbg

We saw this test hang in CI. What likely happened (according to the stacks) is
that a lease transfer that was supposed to be caught by an interceptor never
showed up in the interceptor. The most likely explanation is that it errored
out before it got to evaluation. It then signaled a channel the test was only
prepared to check later, so the test hung (waiting for a channel that was now
never to be touched).

This test is hard to maintain. It would be great (though, for now, out of reach)
to write tests like it in a deterministic framework[^1]

[^1]: see #105177.

For now, fix the test so that when the (so far unknown) error rears its
head again, it will fail properly, so we get to see the error and can
take another pass at fixing the test (separately). Stressing
this commit[^2], we get:

> transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected
> error: refusing to transfer lease to (n3,s3):3 because target may need a Raft
> snapshot: replica in StateProbe

This makes sense. The test wants to exercise the below-raft mechanism, but
the above-raft mechanism also exists and while we didn't want to interact
with it, we sometimes do[^1]

The second commit introduces a testing knob that disables the above-raft
mechanism selectively. I've stressed the test for 15 minutes without issues
after this change.

[^1]: somewhat related to #107524
[^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s

Fixes #106383.

Epic: None
Release note: None


107531: kvserver: disable replicate queue and lease transfers in closedts tests r=erikgrinaker a=tbg

For a more holistic suggestion on how to fix this for the likely many other
tests susceptible to similar issues, see:
#107528

> 1171 runs so far, 0 failures, over 15m55s

Fixes #101824.

Release note: None
Epic: none


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

1 participant