Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestClosedTimestampCantServeWithConflictingIntent failed #101824

Closed
cockroach-teamcity opened this issue Apr 19, 2023 · 9 comments · Fixed by #105897 or #107531
Closed

kv/kvserver: TestClosedTimestampCantServeWithConflictingIntent failed #101824

cockroach-teamcity opened this issue Apr 19, 2023 · 9 comments · Fixed by #105897 or #107531
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 19, 2023

kv/kvserver.TestClosedTimestampCantServeWithConflictingIntent failed with artifacts on master @ a4ab6b6243f509e62993b4d81855f23470d9901a:

=== RUN   TestClosedTimestampCantServeWithConflictingIntent
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent1436482389
    test_log_scope.go:79: use -show-logs to present logs inline
    closed_timestamp_test.go:351: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/closed_timestamp_test.go:351
        	Error:      	Received unexpected error:
        	            	[NotLeaseHolderError] lease held by different store; r63: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):3 seq=2 start=1681901373.794041544,0 exp=1681901379.793263093,0 pro=1681901373.793263093,0
        	Test:       	TestClosedTimestampCantServeWithConflictingIntent
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent1436482389
--- FAIL: TestClosedTimestampCantServeWithConflictingIntent (40.01s)

Parameters: TAGS=bazel,gss,race

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-27142

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Apr 19, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Apr 19, 2023
@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestClosedTimestampCantServeWithConflictingIntent failed with artifacts on master @ 3b1bc872bb4bf8e5eb2ad8790ba8ec781167dd1f:

=== RUN   TestClosedTimestampCantServeWithConflictingIntent
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent3629078817
    test_log_scope.go:79: use -show-logs to present logs inline
    closed_timestamp_test.go:337: request unexpectedly returned, should block; err: [NotLeaseHolderError] lease held by different store; r64: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):2 seq=2 start=1682681331.525769256,0 exp=1682681337.524812851,0 pro=1682681331.524812851,0
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent3629078817
--- FAIL: TestClosedTimestampCantServeWithConflictingIntent (40.36s)

Parameters: TAGS=bazel,gss,race

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestClosedTimestampCantServeWithConflictingIntent failed with artifacts on master @ 4619767bb0b75ac85ddcb76e33c73211e369afed:

=== RUN   TestClosedTimestampCantServeWithConflictingIntent
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent2101771649
    test_log_scope.go:79: use -show-logs to present logs inline
    closed_timestamp_test.go:351: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/closed_timestamp_test.go:351
        	Error:      	Received unexpected error:
        	            	[NotLeaseHolderError] lease held by different store; r64: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):3 seq=2 start=1682767595.564162603,0 exp=1682767601.563077486,0 pro=1682767595.563077486,0
        	Test:       	TestClosedTimestampCantServeWithConflictingIntent
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent2101771649
--- FAIL: TestClosedTimestampCantServeWithConflictingIntent (40.83s)

Parameters: TAGS=bazel,gss,race

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestClosedTimestampCantServeWithConflictingIntent failed with artifacts on master @ 69875da00208dd792cb92aee54811e8a660773a9:

=== RUN   TestClosedTimestampCantServeWithConflictingIntent
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent464149897
    test_log_scope.go:79: use -show-logs to present logs inline
    closed_timestamp_test.go:351: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/closed_timestamp_test.go:351
        	Error:      	Received unexpected error:
        	            	[NotLeaseHolderError] lease held by different store; r64: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=2 start=1684314268.194372668,0 exp=1684314274.193422745,0 pro=1684314268.193422745,0
        	Test:       	TestClosedTimestampCantServeWithConflictingIntent
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent464149897
--- FAIL: TestClosedTimestampCantServeWithConflictingIntent (34.98s)

Parameters: TAGS=bazel,gss,race

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@tbg tbg added the T-kv KV Team label Jun 30, 2023
@erikgrinaker erikgrinaker self-assigned this Jun 30, 2023
@erikgrinaker
Copy link
Contributor

All of these failures use expiration-based leases under race, and probably aren't able to maintain the leases. Submitting a PR to extend the lease duration.

@tbg
Copy link
Member

tbg commented Jul 13, 2023

In https://teamcity.cockroachdb.com/viewLog.html?buildId=10875307&tab=buildResultsDiv&buildTypeId=Cockroach_Ci_Tests_Testrace on #106145

=== RUN   TestClosedTimestampCantServeWithConflictingIntent
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent1610110799
    test_log_scope.go:81: use -show-logs to present logs inline
    closed_timestamp_test.go:342: request unexpectedly returned, should block; err: [NotLeaseHolderError] lease held by different store; r64: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=2 start=1689125159.587330435,0 exp=1689125219.586585395,0 pro=1689125159.586585395,0
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestClosedTimestampCantServeWithConflictingIntent1610110799
--- FAIL: TestClosedTimestampCantServeWithConflictingIntent (35.37s)

Note that exp-start is 60s, and the test still flakes.

@tbg tbg reopened this Jul 13, 2023
@tbg tbg added T-kv-replication and removed T-kv KV Team labels Jul 13, 2023
@erikgrinaker erikgrinaker removed their assignment Jul 17, 2023
@miraradeva miraradeva assigned miraradeva and unassigned miraradeva Jul 17, 2023
@andrewbaptist andrewbaptist added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 19, 2023
@andrewbaptist andrewbaptist self-assigned this Jul 19, 2023
@erikgrinaker
Copy link
Contributor

@andrewbaptist Are you actively working on this? If not, should repl pick it up?

@andrewbaptist
Copy link
Collaborator

Repl can pick this up if you have time.

@erikgrinaker erikgrinaker assigned pav-kv and unassigned andrewbaptist Jul 25, 2023
@tbg tbg assigned tbg and unassigned pav-kv Jul 25, 2023
@tbg
Copy link
Member

tbg commented Jul 25, 2023

The error is here on the last line:

// At the intents' timestamp, reads on the leaseholder should block and reads
// on the followers should be redirected to the leaseholder, even though the
// read timestamp is below the closed timestamp.
respCh2 := runFollowerReads(txn.WriteTimestamp, false)
for i := 0; i < len(repls)-1; i++ {
err := <-respCh2
require.Error(t, err)
var lErr *kvpb.NotLeaseHolderError
require.True(t, errors.As(err, &lErr))
}
select {
case err := <-respCh2:
t.Fatalf("request unexpectedly returned, should block; err: %v", err)

That is, we just1 follower-read successfully at MinTimestamp.Prev() on all replicas and are now expecting to block when we read at MinTimestamp (since there's an intent, presumably). However, instead of that we get a NLHE.

I can imagine two reasons why that is: either the txn got aborted (however the test sets a long txn liveness threshold to avoid this2) or there was some lease change that now made the follower read impossible in the first place (so it never got to bump into the intent).

Some notes on the logs3 of this test: we spend >25s waiting for full replication. Then, pretty much exactly when the test also starts quiescing, we see a split

I230712 01:25:59.582913 1451694 kv/kvserver/replica_command.go:439 ⋮ [T1,n1,split,s1,r63/1:‹/Table/{62-106/1/0}›] 411 initiating a split of this range at key /Table/106 [r65] (‹span config›)

but note that our range is r64, so this is not anywhere our test reads (and even if it were, I don't know how it would be relevant).

I was unable to repro easily on my gceworker with

./dev test --race --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/ 2>&1 | tee stress.log
48 runs so far, 0 failures, over 5m40s

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/193e75aa2cc6b12c583ed4da80236c5408c942b5/pkg/kv/kvserver/closed_timestamp_test.go#L322-L328

  2. https://github.com/cockroachdb/cockroach/blob/193e75aa2cc6b12c583ed4da80236c5408c942b5/pkg/kv/kvserver/closed_timestamp_test.go#L267-L270

  3. https://teamcity.cockroachdb.com/viewLog.html?buildId=10875307&buildTypeId=Cockroach_Ci_Tests_Testrace&tab=artifacts#%2Ftmp%2F_tmp%2F33e1d369c27b9c01b2b6009c561815a3%2FlogTestClosedTimestampCantServeWithConflictingIntent1610110799

@tbg
Copy link
Member

tbg commented Jul 25, 2023

Ah nevermind this explains it

I230712 01:25:59.586438 1451847 13@kv/kvserver/replicate_queue.go:1031 ⋮ [T1,n1,replicate,s1,r64/1:‹/{Table/106/1/0-Max}›] 1104 transferring lease to s3

The test expects leases to be stable. But its using ReplicationAuto so leases can be transferred - another case of #107524 though we can be less obviously incorrect in this test.

Note that #107179 also uses the helper that sets up this cluster, so it too might see errant lease transfers, perhaps they are sharing a root cause.

craig bot pushed a commit that referenced this issue Jul 26, 2023
107265: liveness: allow registering callbacks after start r=erikgrinaker a=tbg

I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart
with additional stores that need to be bootstrapped. In that case, liveness
must be running when the StoreIDs are allocated, but it is not.

Trying to address this problem, I realized that when an auxiliary Store is bootstrapped,
it will create a new replicateQueue, which will register a new callback into NodeLiveness.

But if liveness must be started at this point to fix #106706, we'll run into the assertion
that checks that we don't register callbacks on a started node liveness.

Something's got to give: we will allow registering callbacks at any given point
in time, and they'll get an initial set of notifications synchronously. I
audited the few users of RegisterCallback and this seems OK with all of them.

[^1]: #106706 (comment)

Epic: None
Release note: None


107417: kvserver: ignore RPC conn when deciding to campaign/vote r=erikgrinaker a=erikgrinaker

**kvserver: remove stale mayCampaignOnWake comment**

The comment is about a parameter that no longer exists.

**kvserver: revamp shouldCampaign/Forget tests**

**kvserver: ignore RPC conn in `shouldCampaignOnWake`**

Previously, `shouldCampaignOnWake()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent an unquiescing replica from acquiring Raft leadership if the leader is still alive but unable to heartbeat liveness, and the leader will be unable to acquire epoch leases in this case.

This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state.

**kvserver: ignore RPC conn in `shouldForgetLeaderOnVoteRequest`**

Previously, `shouldForgetLeaderOnVoteRequest()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent granting votes to a new leader that may be attempting to acquire a epoch lease (which the current leader can't).

This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state.

Resolves #107060.
Epic: none
Release note: None

**kvserver: remove `StoreTestingKnobs.DisableLivenessMapConnHealth`**

107424: kvserver: scale Raft entry cache size with system memory r=erikgrinaker a=erikgrinaker

The Raft entry cache size defaulted to 16 MB, which is rather small. This has been seen to cause tail latency and throughput degradation with high write volume on large nodes, correlating with a reduction in the entry cache hit rate.

This patch linearly scales the Raft entry cache size as 1/256 of total system/cgroup memory, shared evenly between all stores, with a minimum 32 MB. For example, a 32 GB 8-vCPU node will have a 128 MB entry cache.

This is a conservative default, since this memory is not accounted for in existing memory budgets nor by the `--cache` flag. We rarely see cache misses in production clusters anyway, and have seen significantly improved hit rates with this scaling (e.g. a 64 KB kv0 workload on 8-vCPU nodes increased from 87% to 99% hit rate).

Resolves #98666.
Epic: none

Release note (performance improvement): The default Raft entry cache size has been increased from 16 MB to 1/256 of system memory with a minimum of 32 MB, divided evenly between all stores. This can be configured via `COCKROACH_RAFT_ENTRY_CACHE_SIZE`.

107442: kvserver: deflake TestRequestsOnFollowerWithNonLiveLeaseholder r=erikgrinaker a=tbg

The test previously relied on aggressive liveness heartbeat expirations to
avoid running for too long. As a result, it was flaky since liveness wasn't
reliably pinned in the way the test wanted.

The hybrid manual clock allows time to jump forward at an opportune moment.
Use it here to avoid running with a tight lease interval.

On my gceworker, previously flaked within a few minutes. As of this commit, I
ran it for double-digit minutes without issue.

Fixes #107200.

Epic: None
Release note: None


107526: kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot r=erikgrinaker a=tbg

We saw this test hang in CI. What likely happened (according to the stacks) is
that a lease transfer that was supposed to be caught by an interceptor never
showed up in the interceptor. The most likely explanation is that it errored
out before it got to evaluation. It then signaled a channel the test was only
prepared to check later, so the test hung (waiting for a channel that was now
never to be touched).

This test is hard to maintain. It would be great (though, for now, out of reach)
to write tests like it in a deterministic framework[^1]

[^1]: see #105177.

For now, fix the test so that when the (so far unknown) error rears its
head again, it will fail properly, so we get to see the error and can
take another pass at fixing the test (separately). Stressing
this commit[^2], we get:

> transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected
> error: refusing to transfer lease to (n3,s3):3 because target may need a Raft
> snapshot: replica in StateProbe

This makes sense. The test wants to exercise the below-raft mechanism, but
the above-raft mechanism also exists and while we didn't want to interact
with it, we sometimes do[^1]

The second commit introduces a testing knob that disables the above-raft
mechanism selectively. I've stressed the test for 15 minutes without issues
after this change.

[^1]: somewhat related to #107524
[^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s

Fixes #106383.

Epic: None
Release note: None


107531: kvserver: disable replicate queue and lease transfers in closedts tests r=erikgrinaker a=tbg

For a more holistic suggestion on how to fix this for the likely many other
tests susceptible to similar issues, see:
#107528

> 1171 runs so far, 0 failures, over 15m55s

Fixes #101824.

Release note: None
Epic: none


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@craig craig bot closed this as completed in 2c72155 Jul 26, 2023
craig bot pushed a commit that referenced this issue Jul 27, 2023
107613: kvserver: avoid moving parts in closedts tests r=erikgrinaker a=tbg

Many tests (two just this last week[^1][^2]) are flaky because the replicate queue
can make rebalancing decisions that undermine the state the test is trying to
set up. Often, this happens "accidentally" because ReplicationAuto is our default
ReplicationMode.

This PR improves the situation at least for closed timestamp integration tests
by switching them all over to `ReplicationManual` (and preventing any new ones
from accidentally using `ReplicationAuto` in the future).

This can be seen as a small step towards #107528, which I am increasingly
convinced is an ocean worth boiling.

[^1]: #107179
[^2]: #101824

Epic: None
Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
blathers-crl bot pushed a commit that referenced this issue Aug 11, 2023
For a more holistic suggestion on how to fix this for the likely many other
tests susceptible to similar issues, see:
#107528

Fixes #101824.

Release note: None
Epic: none
erikgrinaker pushed a commit that referenced this issue Aug 12, 2023
For a more holistic suggestion on how to fix this for the likely many other
tests susceptible to similar issues, see:
#107528

Fixes #101824.

Release note: None
Epic: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
6 participants