kvserver: allow reliably pinning and expiring leases in tests #107524

tbg · 2023-07-25T11:10:59Z

In many tests¹² we "just" want the lease for range X to be on store Y and to stay there even when the test is being stressed under nightly and perhaps on an overloaded VM. There are a number of factors that can get in the way of this humble goal.

ability to acquire lease is tied to raft leadership³⁴
leases are expiration-based (either directly or through liveness)

The second point is a major source of flakiness, and a pretty annoying one at that: pretty rare, but affects lots of tests, and no satisfying solutions (skip test under stress? Mess with timeouts?).

The reason our leases have an expiration is because we want the ability to fail over in case of an unexpected node crash. But in our TestCluster integration tests, this is not a concern and we should be able to pin leases by essentially giving them an infinite lifetime. (Such leases can still be shortened later). Liveness makes this a little more annoying, but we could likely just use "vanilla" expiration-based leases by default, which may be the direction CRDB as a whole is heading as well⁵

So one concrete suggestion would be for (TestCluster).TransferLease to default to giving out long-lived expiration-based leases (you'd have to opt into shorter leases with a config option and tests would fail of nodes in the cluster got shut down without this option). Additionally, automatic lease transfers (replicate queue / store rebalancer) should be off by default. This can be seen as a refinement of the ReplicationMode. Note that there is evidence that a number of tests are "accidentally" using the default ReplicationAuto which also allows for rebalances and lease transfers - this is often explicitly counter to what the test wants⁶

Expiring leases is another use case. The best strategy currently used is to set up the cluster with a hybrid manual clock (which ticks like a clock but can also be accelerated by a delta) and to go through some arcane incantations to forward the clock by the right amount, together with a retry loop just in case a lease acquisition is racing with the clock bump⁷. This could be made easier, too, so that we can "just" call tc.ExpireLeases() or something like that⁸, and this should work by default.

Note: I see that we already have some way to "pin leases":

cockroach/pkg/kv/kvserver/testing_knobs.go

Lines 527 to 551 in 66c9f93

    
           // PinnedLeasesKnob is a testing know for controlling what store can acquire a 
        
           // lease for specific ranges. 
        
           type PinnedLeasesKnob struct { 
        
           	mu struct { 
        
           		syncutil.Mutex 
        
           		pinned map[roachpb.RangeID]roachpb.StoreID 
        
           	} 
        
           } 
        
           // NewPinnedLeases creates a PinnedLeasesKnob. 
        
           func NewPinnedLeases() *PinnedLeasesKnob { 
        
           	p := &PinnedLeasesKnob{} 
        
           	p.mu.pinned = make(map[roachpb.RangeID]roachpb.StoreID) 
        
           	return p 
        
           } 
        
           // PinLease makes it so that only the specified store can take a lease for the 
        
           // specified range. Replicas on other stores attempting to acquire a lease will 
        
           // get NotLeaseHolderErrors. Lease transfers away from the pinned store are 
        
           // permitted. 
        
           func (p *PinnedLeasesKnob) PinLease(rangeID roachpb.RangeID, storeID roachpb.StoreID) { 
        
           	p.mu.Lock() 
        
           	defer p.mu.Unlock() 
        
           	p.mu.pinned[rangeID] = storeID 
        
           }

I assume it works by restricting which raft instance can win elections, and which replica can become leader, by suppressing any message that could lead to an unwanted outcome. Seems like an involved hammer but maybe it can be adopted more, too, and improved.

Jira issue: CRDB-30095

Epic CRDB-39956

https://github.com/cockroachdb/cockroach/issues/107200#issuecomment-1647808624 ↩
see https://github.com/cockroachdb/cockroach/issues/101824#issuecomment-1614777449 ↩
see https://github.com/cockroachdb/cockroach/issues/107523 ↩
I think calling TransferLease takes care of this though, so maybe not an actual issue ↩
https://github.com/cockroachdb/cockroach/issues/94592 ↩
https://github.com/cockroachdb/cockroach/issues/101824#issuecomment-1649893853 ↩
https://github.com/cockroachdb/cockroach/pull/107442 ↩
see https://github.com/cockroachdb/cockroach/blob/fd6f2830764bf6ca0c94f849ad2aa9cca10fd38b/pkg/testutils/testcluster/testcluster.go#L1099-L1106 for something ~related that we could build out ↩

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-07-25T11:11:01Z

cc @cockroachdb/replication

…apshot We saw this test hang in CI. What likely happened (according to the stacks) is that a lease transfer that was supposed to be caught by an interceptor never showed up in the interceptor. The most likely explanation is that it errored out before it got to evaluation. It then signaled a channel the test was only prepared to check later, so the test hung (waiting for a channel that was now never to be touched). This test is hard to maintain. It would be great (though, for now, out of reach) to write tests like it in a deterministic framework[^1] [^1]: see cockroachdb#105177. For now, fix the test so that when the (so far unknown) error rears its head again, it will fail properly, so we get to see the error and can take another pass at fixing the test (separately). Stressing this commit[^2], we get: > transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected > error: refusing to transfer lease to (n3,s3):3 because target may need a Raft > snapshot: replica in StateProbe This makes sense. The test wants to exercise the below-raft mechanism, but the above-raft mechanism also exists and while we didn't want to interact with it, we sometimes do[^1] [^1]: somewhat related to cockroachdb#107524 [^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s Touches cockroachdb#106383. Epic: None Release note: None

107265: liveness: allow registering callbacks after start r=erikgrinaker a=tbg I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart with additional stores that need to be bootstrapped. In that case, liveness must be running when the StoreIDs are allocated, but it is not. Trying to address this problem, I realized that when an auxiliary Store is bootstrapped, it will create a new replicateQueue, which will register a new callback into NodeLiveness. But if liveness must be started at this point to fix #106706, we'll run into the assertion that checks that we don't register callbacks on a started node liveness. Something's got to give: we will allow registering callbacks at any given point in time, and they'll get an initial set of notifications synchronously. I audited the few users of RegisterCallback and this seems OK with all of them. [^1]: #106706 (comment) Epic: None Release note: None 107417: kvserver: ignore RPC conn when deciding to campaign/vote r=erikgrinaker a=erikgrinaker **kvserver: remove stale mayCampaignOnWake comment** The comment is about a parameter that no longer exists. **kvserver: revamp shouldCampaign/Forget tests** **kvserver: ignore RPC conn in `shouldCampaignOnWake`** Previously, `shouldCampaignOnWake()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent an unquiescing replica from acquiring Raft leadership if the leader is still alive but unable to heartbeat liveness, and the leader will be unable to acquire epoch leases in this case. This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state. **kvserver: ignore RPC conn in `shouldForgetLeaderOnVoteRequest`** Previously, `shouldForgetLeaderOnVoteRequest()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent granting votes to a new leader that may be attempting to acquire a epoch lease (which the current leader can't). This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state. Resolves #107060. Epic: none Release note: None **kvserver: remove `StoreTestingKnobs.DisableLivenessMapConnHealth`** 107424: kvserver: scale Raft entry cache size with system memory r=erikgrinaker a=erikgrinaker The Raft entry cache size defaulted to 16 MB, which is rather small. This has been seen to cause tail latency and throughput degradation with high write volume on large nodes, correlating with a reduction in the entry cache hit rate. This patch linearly scales the Raft entry cache size as 1/256 of total system/cgroup memory, shared evenly between all stores, with a minimum 32 MB. For example, a 32 GB 8-vCPU node will have a 128 MB entry cache. This is a conservative default, since this memory is not accounted for in existing memory budgets nor by the `--cache` flag. We rarely see cache misses in production clusters anyway, and have seen significantly improved hit rates with this scaling (e.g. a 64 KB kv0 workload on 8-vCPU nodes increased from 87% to 99% hit rate). Resolves #98666. Epic: none Release note (performance improvement): The default Raft entry cache size has been increased from 16 MB to 1/256 of system memory with a minimum of 32 MB, divided evenly between all stores. This can be configured via `COCKROACH_RAFT_ENTRY_CACHE_SIZE`. 107442: kvserver: deflake TestRequestsOnFollowerWithNonLiveLeaseholder r=erikgrinaker a=tbg The test previously relied on aggressive liveness heartbeat expirations to avoid running for too long. As a result, it was flaky since liveness wasn't reliably pinned in the way the test wanted. The hybrid manual clock allows time to jump forward at an opportune moment. Use it here to avoid running with a tight lease interval. On my gceworker, previously flaked within a few minutes. As of this commit, I ran it for double-digit minutes without issue. Fixes #107200. Epic: None Release note: None 107526: kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot r=erikgrinaker a=tbg We saw this test hang in CI. What likely happened (according to the stacks) is that a lease transfer that was supposed to be caught by an interceptor never showed up in the interceptor. The most likely explanation is that it errored out before it got to evaluation. It then signaled a channel the test was only prepared to check later, so the test hung (waiting for a channel that was now never to be touched). This test is hard to maintain. It would be great (though, for now, out of reach) to write tests like it in a deterministic framework[^1] [^1]: see #105177. For now, fix the test so that when the (so far unknown) error rears its head again, it will fail properly, so we get to see the error and can take another pass at fixing the test (separately). Stressing this commit[^2], we get: > transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected > error: refusing to transfer lease to (n3,s3):3 because target may need a Raft > snapshot: replica in StateProbe This makes sense. The test wants to exercise the below-raft mechanism, but the above-raft mechanism also exists and while we didn't want to interact with it, we sometimes do[^1] The second commit introduces a testing knob that disables the above-raft mechanism selectively. I've stressed the test for 15 minutes without issues after this change. [^1]: somewhat related to #107524 [^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s Fixes #106383. Epic: None Release note: None 107531: kvserver: disable replicate queue and lease transfers in closedts tests r=erikgrinaker a=tbg For a more holistic suggestion on how to fix this for the likely many other tests susceptible to similar issues, see: #107528 > 1171 runs so far, 0 failures, over 15m55s Fixes #101824. Release note: None Epic: none Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure T-kv-replication quality-friday A good issue to work on on Quality Friday labels Jul 25, 2023

tbg mentioned this issue Jul 25, 2023

kv/kvserver: TestClosedTimestampCantServeWithConflictingIntent failed #101824

Closed

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: allow reliably pinning and expiring leases in tests #107524

kvserver: allow reliably pinning and expiring leases in tests #107524

tbg commented Jul 25, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Jul 25, 2023

kvserver: allow reliably pinning and expiring leases in tests #107524

kvserver: allow reliably pinning and expiring leases in tests #107524

Comments

tbg commented Jul 25, 2023 • edited by exalate-issue-sync bot Loading

Footnotes

blathers-crl bot commented Jul 25, 2023

tbg commented Jul 25, 2023 •

edited by exalate-issue-sync bot

Loading