kv/kvserver: TestRequestsOnFollowerWithNonLiveLeaseholder failed #107200

cockroach-teamcity · 2023-07-19T19:14:23Z

kv/kvserver.TestRequestsOnFollowerWithNonLiveLeaseholder failed with artifacts on master @ 30acaf93015008a672667c0d3fdc00ac95f446a6:

=== RUN   TestRequestsOnFollowerWithNonLiveLeaseholder
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/7e19515d2bcbb456ac5289a12c95b9a2/logTestRequestsOnFollowerWithNonLiveLeaseholder1391283936
    test_log_scope.go:81: use -show-logs to present logs inline
    client_raft_test.go:1439: condition failed to evaluate within 45s: from client_raft_test.go:1442: lease still valid
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/7e19515d2bcbb456ac5289a12c95b9a2/logTestRequestsOnFollowerWithNonLiveLeaseholder1391283936
--- FAIL: TestRequestsOnFollowerWithNonLiveLeaseholder (46.79s)

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/replication _{This test on roachdash | Improve this report!

Jira issue: CRDB-29943}

The text was updated successfully, but these errors were encountered:

tbg · 2023-07-24T11:13:59Z

At first glance, this is an interesting failure. The test fails here:

cockroach/pkg/kv/kvserver/client_raft_test.go

Lines 1438 to 1446 in a851fe9

    
           log.Infof(ctx, "test: waiting for lease expiration") 
        
           testutils.SucceedsSoon(t, func() error { 
        
           	leaseStatus = store0Repl.CurrentLeaseStatus(ctx) 
        
           	if leaseStatus.IsValid() { 
        
           		return errors.New("lease still valid") 
        
           	} 
        
           	return nil 
        
           }) 
        
           log.Infof(ctx, "test: lease expired")

It's literally just waiting for an epoch-based lease to expire. Shouldn't take 45s!

Looked at the logs. The test prints this when it introduces the partition that should wreck liveness

I230719 18:53:31.959585 2572 kv/kvserver_test/client_raft_test.go:1438 ⋮ [-] 319 test: waiting for lease expiration

and it definitely gets wrecked! There are tons of

E230719 18:54:06.071519 40658 kv/kvserver/replica_range_lease.go:471 ⋮ [T1,n1,s1,r9/1:‹/Table/{5-6}›] 6230 failed to heartbeat own liveness record: context canceled

and yet, 40s in,

I230719 18:54:06.071785 2572 testutils/soon.go:42 ⋮ [-] 6231 SucceedsSoon: lease still valid

This is a mostly vanilla TestCluster (mod an introduced network partition, etc) and so it ought to be on a regular clock (i.e. no manual clock that doesn't move over time).

tbg · 2023-07-24T11:27:23Z

Good news is, got a repro¹. Will add some logging, hopefully once we see the actual lease printed it's a little less mysterious.

./dev test --stress --filter TestRequestsOnFollowerWithNonLiveLeaseholder ./pkg/kv/kvserver/ on gceworker after 185s. ↩

tbg · 2023-07-24T11:31:39Z

    client_raft_test.go:1439: condition failed to evaluate within 45s: from client_raft_test.go:1442: lease still valid: {Lease:repl=(n3,s3):3 seq=2
start=1690197987.025104522,0 epo=1
pro=1690197987.025104522,0
Now:1690198028.622447513,0 RequestTime:1690198028.622447513,0 State:VALID ErrInfo: Liveness:liveness(nid:3 epo:1 exp:1690198030.317778663,0) MinValidObservedTimestamp:0,0}

The interesting part is n3. I believe it should be n1:

https://github.com/cockroachdb/cockroach/blob/cff213d53ed1569acf35cf5b12a67fb810360713/pkg/kv/kvserver/client_raft_test.go#L1398-L1399

tbg · 2023-07-24T12:20:57Z

Ah, duh, I was reading this wrong. We are checking n1, no doubt about that. But the lease is on n3 in the case in which the test fails. We're probably not managing to pin the lease appropriately.

107265: liveness: allow registering callbacks after start r=erikgrinaker a=tbg I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart with additional stores that need to be bootstrapped. In that case, liveness must be running when the StoreIDs are allocated, but it is not. Trying to address this problem, I realized that when an auxiliary Store is bootstrapped, it will create a new replicateQueue, which will register a new callback into NodeLiveness. But if liveness must be started at this point to fix #106706, we'll run into the assertion that checks that we don't register callbacks on a started node liveness. Something's got to give: we will allow registering callbacks at any given point in time, and they'll get an initial set of notifications synchronously. I audited the few users of RegisterCallback and this seems OK with all of them. [^1]: #106706 (comment) Epic: None Release note: None 107417: kvserver: ignore RPC conn when deciding to campaign/vote r=erikgrinaker a=erikgrinaker **kvserver: remove stale mayCampaignOnWake comment** The comment is about a parameter that no longer exists. **kvserver: revamp shouldCampaign/Forget tests** **kvserver: ignore RPC conn in `shouldCampaignOnWake`** Previously, `shouldCampaignOnWake()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent an unquiescing replica from acquiring Raft leadership if the leader is still alive but unable to heartbeat liveness, and the leader will be unable to acquire epoch leases in this case. This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state. **kvserver: ignore RPC conn in `shouldForgetLeaderOnVoteRequest`** Previously, `shouldForgetLeaderOnVoteRequest()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent granting votes to a new leader that may be attempting to acquire a epoch lease (which the current leader can't). This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state. Resolves #107060. Epic: none Release note: None **kvserver: remove `StoreTestingKnobs.DisableLivenessMapConnHealth`** 107424: kvserver: scale Raft entry cache size with system memory r=erikgrinaker a=erikgrinaker The Raft entry cache size defaulted to 16 MB, which is rather small. This has been seen to cause tail latency and throughput degradation with high write volume on large nodes, correlating with a reduction in the entry cache hit rate. This patch linearly scales the Raft entry cache size as 1/256 of total system/cgroup memory, shared evenly between all stores, with a minimum 32 MB. For example, a 32 GB 8-vCPU node will have a 128 MB entry cache. This is a conservative default, since this memory is not accounted for in existing memory budgets nor by the `--cache` flag. We rarely see cache misses in production clusters anyway, and have seen significantly improved hit rates with this scaling (e.g. a 64 KB kv0 workload on 8-vCPU nodes increased from 87% to 99% hit rate). Resolves #98666. Epic: none Release note (performance improvement): The default Raft entry cache size has been increased from 16 MB to 1/256 of system memory with a minimum of 32 MB, divided evenly between all stores. This can be configured via `COCKROACH_RAFT_ENTRY_CACHE_SIZE`. 107442: kvserver: deflake TestRequestsOnFollowerWithNonLiveLeaseholder r=erikgrinaker a=tbg The test previously relied on aggressive liveness heartbeat expirations to avoid running for too long. As a result, it was flaky since liveness wasn't reliably pinned in the way the test wanted. The hybrid manual clock allows time to jump forward at an opportune moment. Use it here to avoid running with a tight lease interval. On my gceworker, previously flaked within a few minutes. As of this commit, I ran it for double-digit minutes without issue. Fixes #107200. Epic: None Release note: None 107526: kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot r=erikgrinaker a=tbg We saw this test hang in CI. What likely happened (according to the stacks) is that a lease transfer that was supposed to be caught by an interceptor never showed up in the interceptor. The most likely explanation is that it errored out before it got to evaluation. It then signaled a channel the test was only prepared to check later, so the test hung (waiting for a channel that was now never to be touched). This test is hard to maintain. It would be great (though, for now, out of reach) to write tests like it in a deterministic framework[^1] [^1]: see #105177. For now, fix the test so that when the (so far unknown) error rears its head again, it will fail properly, so we get to see the error and can take another pass at fixing the test (separately). Stressing this commit[^2], we get: > transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected > error: refusing to transfer lease to (n3,s3):3 because target may need a Raft > snapshot: replica in StateProbe This makes sense. The test wants to exercise the below-raft mechanism, but the above-raft mechanism also exists and while we didn't want to interact with it, we sometimes do[^1] The second commit introduces a testing knob that disables the above-raft mechanism selectively. I've stressed the test for 15 minutes without issues after this change. [^1]: somewhat related to #107524 [^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s Fixes #106383. Epic: None Release note: None 107531: kvserver: disable replicate queue and lease transfers in closedts tests r=erikgrinaker a=tbg For a more holistic suggestion on how to fix this for the likely many other tests susceptible to similar issues, see: #107528 > 1171 runs so far, 0 failures, over 15m55s Fixes #101824. Release note: None Epic: none Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv-replication labels Jul 19, 2023

cockroach-teamcity added this to the 23.2 milestone Jul 19, 2023

exalate-issue-sync bot assigned tbg Jul 20, 2023

tbg mentioned this issue Jul 24, 2023

kvserver: TestRequestsOnFollowerWithNonLiveLeaseholder is flaky #107098

Closed

tbg mentioned this issue Jul 24, 2023

kvserver: deflake TestRequestsOnFollowerWithNonLiveLeaseholder #107442

Merged

craig bot closed this as completed in 04d3682 Jul 26, 2023

irfansharif added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv/kvserver: TestRequestsOnFollowerWithNonLiveLeaseholder failed #107200

kv/kvserver: TestRequestsOnFollowerWithNonLiveLeaseholder failed #107200

cockroach-teamcity commented Jul 19, 2023 •

edited by cockroach-jira-scripts

Loading

tbg commented Jul 24, 2023

tbg commented Jul 24, 2023

tbg commented Jul 24, 2023

tbg commented Jul 24, 2023

kv/kvserver: TestRequestsOnFollowerWithNonLiveLeaseholder failed #107200

kv/kvserver: TestRequestsOnFollowerWithNonLiveLeaseholder failed #107200

Comments

cockroach-teamcity commented Jul 19, 2023 • edited by cockroach-jira-scripts Loading

tbg commented Jul 24, 2023

tbg commented Jul 24, 2023

Footnotes

tbg commented Jul 24, 2023

tbg commented Jul 24, 2023

cockroach-teamcity commented Jul 19, 2023 •

edited by cockroach-jira-scripts

Loading