-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot #107526
kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot #107526
Conversation
…apshot We saw this test hang in CI. What likely happened (according to the stacks) is that a lease transfer that was supposed to be caught by an interceptor never showed up in the interceptor. The most likely explanation is that it errored out before it got to evaluation. It then signaled a channel the test was only prepared to check later, so the test hung (waiting for a channel that was now never to be touched). This test is hard to maintain. It would be great (though, for now, out of reach) to write tests like it in a deterministic framework[^1] [^1]: see cockroachdb#105177. For now, fix the test so that when the (so far unknown) error rears its head again, it will fail properly, so we get to see the error and can take another pass at fixing the test (separately). Stressing this commit[^2], we get: > transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected > error: refusing to transfer lease to (n3,s3):3 because target may need a Raft > snapshot: replica in StateProbe This makes sense. The test wants to exercise the below-raft mechanism, but the above-raft mechanism also exists and while we didn't want to interact with it, we sometimes do[^1] [^1]: somewhat related to cockroachdb#107524 [^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s Touches cockroachdb#106383. Epic: None Release note: None
3f72ae1
to
608c949
Compare
See previous commit. We sometimes hit the above-raft check when we wanted to hit only the below-raft one. This commit fixes this by selectively disabling the above-raft check in this test. Note that not all tests using lease transfers are susceptible to this problem in the way that this test is. This is because this test also lowers the `LeaseTransferRejectedRetryLoopCount`[^1] because it is intentionally manufacturing failed lease transfers and doesn't want to sit out the retry loop. It is that time-saving optimization that also allows the spurious error to bubble up. [^1]: https://github.com/cockroachdb/cockroach/blob/66c9f93ae86bddd7ba4c5f6a6b8b6cb700ca23ce/pkg/kv/kvserver/testing_knobs.go#L375 Epic: none Release note: None
ef1302b
to
1c8c503
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new testing knob isn't necessary, you can use AdminTransferLeaseBypassingSafetyChecks
:
Lines 666 to 675 in adfa97f
// AdminTransferLeaseBypassingSafetyChecks is like AdminTransferLease, but | |
// configures the lease transfer to bypass safety checks. See the comment on | |
// AdminTransferLeaseRequest.BypassSafetyChecks for details. | |
func (db *DB) AdminTransferLeaseBypassingSafetyChecks( | |
ctx context.Context, key interface{}, target roachpb.StoreID, | |
) error { | |
b := &Batch{} | |
b.adminTransferLease(key, target, true /* bypassSafetyChecks */) | |
return getOneErr(db.Run(ctx, b), b) | |
} |
That disables both checks (above and below raft), but I need the below-raft to stay active. TFTR! bors r=erikgrinaker |
Build succeeded: |
We saw this test hang in CI. What likely happened (according to the stacks) is
that a lease transfer that was supposed to be caught by an interceptor never
showed up in the interceptor. The most likely explanation is that it errored
out before it got to evaluation. It then signaled a channel the test was only
prepared to check later, so the test hung (waiting for a channel that was now
never to be touched).
This test is hard to maintain. It would be great (though, for now, out of reach)
to write tests like it in a deterministic framework1
For now, fix the test so that when the (so far unknown) error rears its
head again, it will fail properly, so we get to see the error and can
take another pass at fixing the test (separately). Stressing
this commit2, we get:
This makes sense. The test wants to exercise the below-raft mechanism, but
the above-raft mechanism also exists and while we didn't want to interact
with it, we sometimes do1
The second commit introduces a testing knob that disables the above-raft
mechanism selectively. I've stressed the test for 15 minutes without issues
after this change.
Fixes #106383.
Epic: None
Release note: None
Footnotes
see https://github.com/cockroachdb/cockroach/issues/105177. ↩ ↩2
./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/
on gceworker, 285s ↩