Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [cpu imbalance] #106140

Closed
cockroach-teamcity opened this issue Jul 5, 2023 · 13 comments
Closed
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 5, 2023

roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ 34699bb9c1557fce449e08a68cd259efec94926f:

(monitor.go:137).Wait: monitor failure: monitor task failed: output in run_081411.580746404_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z' " returned: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=8/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-29424

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Jul 5, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 5, 2023
@dt
Copy link
Member

dt commented Jul 5, 2023

Looks like a loss of availability at KV level; the eventual error is:

splitting key /Table/134/1/200000053952836/0/NULL: replica unavailable: (n4,s4):3 unable to serve request to r160:/Table/134/1/20000005{3567542/0/NULL-8067028/0/NULL} [(n6,s6):1, (n2,s2):2, (n4,s4):3, next=4, gen=114, sticky=1688548477.833576746,0]: closed timestamp: 1688544989.016941461,0 (2023-07-05 08:16:29); raft status: {"id":"3","term":10,"vote":"0","commit":27,"lead":"1","raftState":"StateFollower","applied":27,"progress":{},"leadtransferee":"0"}: have been waiting 92.00s for slow proposal AddSSTable [/Table/134/1/200000053760300/0,/Table/134/1/200000053952838/0/NULL)

Prior to that we have alerts in the test logs for requests.slow.raft, ranges.unavailable, liveness.heartbeatfailures, etc, mostly for n4, so looks like something broke below the layer of restore.

@blathers-crl blathers-crl bot added the T-kv KV Team label Jul 5, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ dbe8511fae8fca21562fdde5c240b1f7d06ef582:

(test_runner.go:1075).runTest: test timed out (1h0m0s)
(monitor.go:137).Wait: monitor failure: monitor task failed: output in run_075409.433432438_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z' " returned: COMMAND_PROBLEM: exit status 137
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=8/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@irfansharif
Copy link
Contributor

Just noting that failures here don't have the "operating on a closed handle" logging present in the mutex leak scenario we saw over at #106078.

@blathers-crl
Copy link

blathers-crl bot commented Jul 18, 2023

cc @cockroachdb/replication

@exalate-issue-sync exalate-issue-sync bot assigned tbg and unassigned pav-kv Jul 19, 2023
@tbg
Copy link
Member

tbg commented Jul 19, 2023

Pattern-matching on #106078 (comment) that this was the kvflowcontrol deadlock.

@tbg tbg closed this as completed Jul 19, 2023
@andrewbaptist andrewbaptist added the X-duplicate Closed as a duplicate of another issue. label Jul 19, 2023
@irfansharif
Copy link
Contributor

@tbg: BTW, see comment above (#106140 (comment)), I don't think it was the same deadlock issue since we didn't have the logging that would incriminate the mutex leak.

@tbg tbg reopened this Jul 26, 2023
@tbg
Copy link
Member

tbg commented Jul 26, 2023

Thanks, I had missed that comment.

tbg added a commit to tbg/cockroach that referenced this issue Jul 27, 2023
It doesn't rise up to "ERROR" level and also doesn't need to log a large stack
trace.

Seen while looking into cockroachdb#106140.

Epic: none
Release note: None
@tbg
Copy link
Member

tbg commented Jul 27, 2023

Artifacts here in case TC nukes them.

No deadlock as far as I can tell, but CPU overload (I don't think storage is overloaded here, though storage also gets very slow as a result of CPU starvation).

We routinely see the very slow raft readies, quota pool stalls, and eventually things get slow enough to trip the breaker.

Details
W230705 08:17:40.855810 324 2@kv/kvserver/replica_proposal.go:601 ⋮ [T1,n4,s4,r627/7:‹/Table/136/1/200000{195…-205…}›,raft] 457  applied lease after ~16.00s replication lag, client traffic may have been delayed [lease=repl=(n4,s4):7 seq=12 start=1688545014.806605127,0 epo=1 pro=1688545044.851723320,0 prev=repl=(n4,s4):7VOTER_INCOMING seq=11 start=1688544941.269651978,0 epo=1 pro=1688544942.684314815,0]
I230705 08:17:42.874261 29157 util/quotapool/config.go:78 ⋮ [T1,n6,s6,r457/1:‹/Table/134{-/1/2000…}›] 1182  acquired raft proposal quota after 1m15.402339763s
I230705 08:24:43.716988 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r662/5:‹/Table/139/1/20000009{31…-62…}›,raft] 5261  slow non-blocking raft commit: commit-wait 816.827647ms sem 56.144µs
I230705 08:24:43.722804 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r133/3:‹/Table/128/1{2-3}›,raft] 5262  slow non-blocking raft commit: commit-wait 815.690246ms sem 162ns
I230705 08:24:43.748556 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r156/3:‹/Table/125/{7-8}›,raft] 5263  slow non-blocking raft commit: commit-wait 1.0777s sem 296ns
I230705 08:24:43.751876 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r504/3:‹/Table/134/1/2000000{89…-94…}›,raft] 5264  slow non-blocking raft commit: commit-wait 819.498153ms sem 546ns
I230705 08:24:43.753311 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r173/3:‹/Table/128/2{2-3}›,raft] 5265  slow non-blocking raft commit: commit-wait 1.08303961s sem 286ns
I230705 08:24:43.754279 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r660/5:{-},raft] 5266  slow non-blocking raft commit: commit-wait 914.826898ms sem 303ns
I230705 08:24:43.755601 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r350/3:‹/Table/149/1{2-3}›,raft] 5267  slow non-blocking raft commit: commit-wait 1.054286003s sem 581ns
I230705 08:24:43.756880 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r84/6:‹/Table/1{08/2-10}›,raft] 5268  slow non-blocking raft commit: commit-wait 1.055054285s sem 22.333µs
I230705 08:24:43.783312 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r964/4:{-},raft] 5269  slow non-blocking raft commit: commit-wait 883.723402ms sem 306ns
I230705 08:24:44.019769 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r275/3:‹/Table/144/1{7-8}›,raft] 5270  slow non-blocking raft commit: commit-wait 815.0203ms sem 26.551µs
I230705 08:24:44.024406 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r28/6:{-},raft] 5271  slow non-blocking raft commit: commit-wait 983.73795ms sem 1.635308ms

E230705 08:17:40.231109 24317 kv/kvserver/replica_circuit_breaker.go:175 ⋮ [T1,n4,s4,r160/3:‹/Table/134/1/20000005{35…-80…}›] 1973 breaker: tripped with error: replica unavailable: (n4,s4):3 unable to serve request to r160:‹/Table/134/1/20000005{3567542/0/NULL-8067028/0/NULL}› [(n6,s6):1, (n2,s2):2, (n4,s4):3, next=4, gen=114, sticky=1688548477.833576746,0]: closed timestamp: 1688544983.065160897,0 (2023-07-05 08:16:23); raft status: {"id":"3","term":10,"vote":"0","commit":14,"lead":"1","raftState":"StateFollower","applied":14,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal AddSSTable [/Table/134/1/‹200000053567543›/‹0›,/Table/134/1/‹200000053760299›/‹0›/‹NULL›)

Looking at the tsdump, we see that n4 is near 100% CPU throughout. A significant fraction (25%) of that is in runtime.memmove, however this profile is from the debug.zip, i.e. doesn't capture the high initial CPU usage1. But, memory usage on n4 actually never went up like on the other nodes. We also see that n4 never pushes as many writes to disk as the other nodes. n4 looks truly limping here and it's unclear why.

screencapture-localhost-8080-2023-07-27-12_48_52.pdf

image

n4 has a comparable number of goroutines, but it's runnable per CPU is sky high.

image

n4 also has a much larger CPU per Replica:

image

so there does seem to be a load imbalance. The store rebalancer on s4 confirms this:

I230705 08:14:38.079763 341 13@kv/kvserver/store_rebalancer.go:595 ⋮ [T1,n4,s4,store-rebalancer,obj=‹cpu›] 323 ran out of leases worth transferring and load ‹(queries-per-second=54.1 cpu-per-second=1.9s)› is still above desired threshold ‹(queries-per-second=78.7 cpu-per-second=365ms)›; considering load-based [...]

This goes on and on. It moves a bunch of leases over, but it doesn't make a dent. All the while, the node is really overloaded, to the point where - I think - it's generally unable to laterally rebalance replicas away.

I never saw it print anything about which ranges are driving the load. I really wish it had.

Footnotes

  1. The auto CPU profiler could've helped here, but it wasn't on, https://github.com/cockroachdb/cockroach/issues/97699

craig bot pushed a commit that referenced this issue Jul 27, 2023
107690: kvserver: downgrade "unable to transfer lease" log r=kvoli a=tbg

It doesn't rise up to "ERROR" level and also doesn't need to log a large stack
trace.

Seen while looking into #106140.

Epic: none
Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@tbg
Copy link
Member

tbg commented Jul 27, 2023

I filed #107694 to make it easier to figure out what happened here in the future.

@tbg tbg changed the title roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [cpu imbalance] Jul 27, 2023
@tbg tbg added A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team and removed T-kv-replication labels Jul 27, 2023
@tbg
Copy link
Member

tbg commented Jul 27, 2023

Moving this over to KV. I'll need some help from the distribution folks to figure out what may have gone wrong here.

@kvoli
Copy link
Collaborator

kvoli commented Jul 27, 2023

@tbg and I paired to look at this failure. There's not enough info from the logs/tsdump to determine where the CPU is coming from that is keeping the the node hot.

We could determine the CPU usage was at least partially coming from replication, as the node didn't have many/any leases yet the replica CPU remained high.

func (r *Replica) MeasureRaftCPUNanos(start time.Duration) {

Next steps seem like enabling the CPU autoprofiler for this test, or better yet, all non-performance roachtests. Then, waiting for another failure and inspecting the profiles #97699.

@irfansharif irfansharif added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 28, 2023
@kvoli
Copy link
Collaborator

kvoli commented Aug 2, 2023

Stressed on master with #108048 for 20 runs... no hits :(

I'll bump this to 100 runs.

kvoli added a commit to kvoli/cockroach that referenced this issue Aug 3, 2023
`restore/tpce/400GB/aws/nodes=8/cpus=8` recently failed with high CPU
usage on one node cockroachdb#106140. The high CPU usage was attributed within
replication, however we were unable to determine where exactly.

Enable the automatic CPU profiler on `restore/*` roachtests, which will
collect profiles once normalized CPU utilization exceeds 70%.

Note this is supposed to be a temporary addition, which will be subsumed
by cockroachdb#97699.

Informs: cockroachdb#106140

Release note: None
@kvoli kvoli assigned kvoli and unassigned kvoli Aug 3, 2023
@kvoli
Copy link
Collaborator

kvoli commented Aug 3, 2023

This only failed on July 5th and July 6th. Doesn't repro after 40 runs. Closing.

@kvoli kvoli closed this as completed Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

8 participants