-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: failover/non-system/deadlock/lease=expiration failed #129980
Comments
I first thought this might be like #129918 but here, the deadlock failure SQL query times out despite the cluster supposedly being completely healthy. Also, I don't see the timeout server-side. The timing is odd too. We call
and then it only fails ~42s later,
and this is despite the 20s timeout here: cockroach/pkg/cmd/roachtest/tests/failover.go Line 1448 in 1af6df1
The logs for n4 show nothing related to this. There's a lot of snapshots, though, which might be caused by this code from the test /pkg/cmd/roachtest/tests/failover.go#L843-L846 // Ranges may occasionally escape their constraints. Move them
// to where they should be.
relocateRanges(t, ctx, conn, `database_name = 'kv'`, []int{1, 2, 3}, []int{4, 5, 6})
relocateRanges(t, ctx, conn, `database_name != 'kv'`, []int{node}, []int{1, 2, 3}) I had the theory that perhaps the range deadlock query was attempting to deadlock multiple ranges serially (the deadlocker does that). This could deadlock the deadlocker: if the first deadlocked replica ends up deadlocking the store's mutex, trying to acquire another replica mutex from the store after would also deadlock. However, it looks like we fell over on the very first attempt to deadlock anything. There's little in the logs, everything looks tame to me. The stacks for n4 show no deadlocked goroutines, in fact n4 has 1001 ranges according to the debug.zip,
but surely locking the first one shouldn't take that long? I don't think I understand what's going on here. |
I checked the background workload that's running in this test,
and it's performing without a hitch, even though the ranges are on n4-n6 (and the gateways n1-n3). Based on that, removing release-blocker label. |
#129995 might fix this too, but more in the sense of papering over whatever went wrong here, which is unfortunate. |
I echo the puzzlement here. I'm also surprised to see no mention of |
Do we want to close this out once #129995 lands? Other than landing that patch, this looks pretty unactionable. |
Aspirationally fixed in #129995. |
roachtest.failover/non-system/deadlock/lease=expiration failed with artifacts on master @ 8551145a0c99c4c95a28ec470e699d0c20ca97ab:
Parameters:
ROACHTEST_arch=arm64
ROACHTEST_cloud=azure
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=2
ROACHTEST_encrypted=false
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
This test on roachdash | Improve this report!
Jira issue: CRDB-41823
The text was updated successfully, but these errors were encountered: