-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [cpu imbalance] #106140
Comments
Looks like a loss of availability at KV level; the eventual error is:
Prior to that we have alerts in the test logs for |
roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ dbe8511fae8fca21562fdde5c240b1f7d06ef582:
Parameters: |
Just noting that failures here don't have the "operating on a closed handle" logging present in the mutex leak scenario we saw over at #106078. |
cc @cockroachdb/replication |
Pattern-matching on #106078 (comment) that this was the kvflowcontrol deadlock. |
@tbg: BTW, see comment above (#106140 (comment)), I don't think it was the same deadlock issue since we didn't have the logging that would incriminate the mutex leak. |
Thanks, I had missed that comment. |
It doesn't rise up to "ERROR" level and also doesn't need to log a large stack trace. Seen while looking into cockroachdb#106140. Epic: none Release note: None
Artifacts here in case TC nukes them. No deadlock as far as I can tell, but CPU overload (I don't think storage is overloaded here, though storage also gets very slow as a result of CPU starvation). We routinely see the very slow raft readies, quota pool stalls, and eventually things get slow enough to trip the breaker. Details
Looking at the tsdump, we see that n4 is near 100% CPU throughout. A significant fraction (25%) of that is in screencapture-localhost-8080-2023-07-27-12_48_52.pdf n4 has a comparable number of goroutines, but it's runnable per CPU is sky high. n4 also has a much larger CPU per Replica: so there does seem to be a load imbalance. The store rebalancer on s4 confirms this:
This goes on and on. It moves a bunch of leases over, but it doesn't make a dent. All the while, the node is really overloaded, to the point where - I think - it's generally unable to laterally rebalance replicas away. I never saw it print anything about which ranges are driving the load. I really wish it had. Footnotes
|
107690: kvserver: downgrade "unable to transfer lease" log r=kvoli a=tbg It doesn't rise up to "ERROR" level and also doesn't need to log a large stack trace. Seen while looking into #106140. Epic: none Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
I filed #107694 to make it easier to figure out what happened here in the future. |
Moving this over to KV. I'll need some help from the distribution folks to figure out what may have gone wrong here. |
@tbg and I paired to look at this failure. There's not enough info from the logs/tsdump to determine where the CPU is coming from that is keeping the the node hot. We could determine the CPU usage was at least partially coming from replication, as the node didn't have many/any leases yet the replica CPU remained high. cockroach/pkg/kv/kvserver/replica.go Line 2357 in 1d3c11e
Next steps seem like enabling the CPU autoprofiler for this test, or better yet, all non-performance roachtests. Then, waiting for another failure and inspecting the profiles #97699. |
Stressed on master with #108048 for 20 runs... no hits :( I'll bump this to 100 runs. |
`restore/tpce/400GB/aws/nodes=8/cpus=8` recently failed with high CPU usage on one node cockroachdb#106140. The high CPU usage was attributed within replication, however we were unable to determine where exactly. Enable the automatic CPU profiler on `restore/*` roachtests, which will collect profiles once normalized CPU utilization exceeds 70%. Note this is supposed to be a temporary addition, which will be subsumed by cockroachdb#97699. Informs: cockroachdb#106140 Release note: None
This only failed on July 5th and July 6th. Doesn't repro after 40 runs. Closing. |
roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ 34699bb9c1557fce449e08a68cd259efec94926f:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=aws
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-29424
The text was updated successfully, but these errors were encountered: