roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [cpu imbalance] #106140

cockroach-teamcity · 2023-07-05T08:27:38Z

roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ 34699bb9c1557fce449e08a68cd259efec94926f:

(monitor.go:137).Wait: monitor failure: monitor task failed: output in run_081411.580746404_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z' " returned: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=8/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-29424}

The text was updated successfully, but these errors were encountered:

dt · 2023-07-05T18:43:35Z

Looks like a loss of availability at KV level; the eventual error is:

splitting key /Table/134/1/200000053952836/0/NULL: replica unavailable: (n4,s4):3 unable to serve request to r160:/Table/134/1/20000005{3567542/0/NULL-8067028/0/NULL} [(n6,s6):1, (n2,s2):2, (n4,s4):3, next=4, gen=114, sticky=1688548477.833576746,0]: closed timestamp: 1688544989.016941461,0 (2023-07-05 08:16:29); raft status: {"id":"3","term":10,"vote":"0","commit":27,"lead":"1","raftState":"StateFollower","applied":27,"progress":{},"leadtransferee":"0"}: have been waiting 92.00s for slow proposal AddSSTable [/Table/134/1/200000053760300/0,/Table/134/1/200000053952838/0/NULL)

Prior to that we have alerts in the test logs for requests.slow.raft, ranges.unavailable, liveness.heartbeatfailures, etc, mostly for n4, so looks like something broke below the layer of restore.

cockroach-teamcity · 2023-07-06T08:53:56Z

roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ dbe8511fae8fca21562fdde5c240b1f7d06ef582:

(test_runner.go:1075).runTest: test timed out (1h0m0s)
(monitor.go:137).Wait: monitor failure: monitor task failed: output in run_075409.433432438_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z' " returned: COMMAND_PROBLEM: exit status 137
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=8/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

irfansharif · 2023-07-07T16:45:21Z

Just noting that failures here don't have the "operating on a closed handle" logging present in the mutex leak scenario we saw over at #106078.

blathers-crl · 2023-07-18T16:41:42Z

cc @cockroachdb/replication

tbg · 2023-07-19T12:43:53Z

Pattern-matching on #106078 (comment) that this was the kvflowcontrol deadlock.

irfansharif · 2023-07-19T15:43:19Z

@tbg: BTW, see comment above (#106140 (comment)), I don't think it was the same deadlock issue since we didn't have the logging that would incriminate the mutex leak.

tbg · 2023-07-26T07:36:19Z

Thanks, I had missed that comment.

It doesn't rise up to "ERROR" level and also doesn't need to log a large stack trace. Seen while looking into cockroachdb#106140. Epic: none Release note: None

tbg · 2023-07-27T11:02:46Z

Artifacts here in case TC nukes them.

No deadlock as far as I can tell, but CPU overload (I don't think storage is overloaded here, though storage also gets very slow as a result of CPU starvation).

We routinely see the very slow raft readies, quota pool stalls, and eventually things get slow enough to trip the breaker.

Details

W230705 08:17:40.855810 324 2@kv/kvserver/replica_proposal.go:601 ⋮ [T1,n4,s4,r627/7:‹/Table/136/1/200000{195…-205…}›,raft] 457  applied lease after ~16.00s replication lag, client traffic may have been delayed [lease=repl=(n4,s4):7 seq=12 start=1688545014.806605127,0 epo=1 pro=1688545044.851723320,0 prev=repl=(n4,s4):7VOTER_INCOMING seq=11 start=1688544941.269651978,0 epo=1 pro=1688544942.684314815,0]

I230705 08:17:42.874261 29157 util/quotapool/config.go:78 ⋮ [T1,n6,s6,r457/1:‹/Table/134{-/1/2000…}›] 1182  acquired raft proposal quota after 1m15.402339763s

I230705 08:24:43.716988 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r662/5:‹/Table/139/1/20000009{31…-62…}›,raft] 5261  slow non-blocking raft commit: commit-wait 816.827647ms sem 56.144µs
I230705 08:24:43.722804 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r133/3:‹/Table/128/1{2-3}›,raft] 5262  slow non-blocking raft commit: commit-wait 815.690246ms sem 162ns
I230705 08:24:43.748556 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r156/3:‹/Table/125/{7-8}›,raft] 5263  slow non-blocking raft commit: commit-wait 1.0777s sem 296ns
I230705 08:24:43.751876 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r504/3:‹/Table/134/1/2000000{89…-94…}›,raft] 5264  slow non-blocking raft commit: commit-wait 819.498153ms sem 546ns
I230705 08:24:43.753311 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r173/3:‹/Table/128/2{2-3}›,raft] 5265  slow non-blocking raft commit: commit-wait 1.08303961s sem 286ns
I230705 08:24:43.754279 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r660/5:{-},raft] 5266  slow non-blocking raft commit: commit-wait 914.826898ms sem 303ns
I230705 08:24:43.755601 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r350/3:‹/Table/149/1{2-3}›,raft] 5267  slow non-blocking raft commit: commit-wait 1.054286003s sem 581ns
I230705 08:24:43.756880 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r84/6:‹/Table/1{08/2-10}›,raft] 5268  slow non-blocking raft commit: commit-wait 1.055054285s sem 22.333µs
I230705 08:24:43.783312 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r964/4:{-},raft] 5269  slow non-blocking raft commit: commit-wait 883.723402ms sem 306ns
I230705 08:24:44.019769 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r275/3:‹/Table/144/1{7-8}›,raft] 5270  slow non-blocking raft commit: commit-wait 815.0203ms sem 26.551µs
I230705 08:24:44.024406 335 kv/kvserver/replica_raft.go:1572 ⋮ [T1,n4,s4,r28/6:{-},raft] 5271  slow non-blocking raft commit: commit-wait 983.73795ms sem 1.635308ms

E230705 08:17:40.231109 24317 kv/kvserver/replica_circuit_breaker.go:175 ⋮ [T1,n4,s4,r160/3:‹/Table/134/1/20000005{35…-80…}›] 1973 breaker: tripped with error: replica unavailable: (n4,s4):3 unable to serve request to r160:‹/Table/134/1/20000005{3567542/0/NULL-8067028/0/NULL}› [(n6,s6):1, (n2,s2):2, (n4,s4):3, next=4, gen=114, sticky=1688548477.833576746,0]: closed timestamp: 1688544983.065160897,0 (2023-07-05 08:16:23); raft status: {"id":"3","term":10,"vote":"0","commit":14,"lead":"1","raftState":"StateFollower","applied":14,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal AddSSTable [/Table/134/1/‹200000053567543›/‹0›,/Table/134/1/‹200000053760299›/‹0›/‹NULL›)

Looking at the tsdump, we see that n4 is near 100% CPU throughout. A significant fraction (25%) of that is in runtime.memmove, however this profile is from the debug.zip, i.e. doesn't capture the high initial CPU usage¹. But, memory usage on n4 actually never went up like on the other nodes. We also see that n4 never pushes as many writes to disk as the other nodes. n4 looks truly limping here and it's unclear why.

screencapture-localhost-8080-2023-07-27-12_48_52.pdf

n4 has a comparable number of goroutines, but it's runnable per CPU is sky high.

n4 also has a much larger CPU per Replica:

so there does seem to be a load imbalance. The store rebalancer on s4 confirms this:

I230705 08:14:38.079763 341 13@kv/kvserver/store_rebalancer.go:595 ⋮ [T1,n4,s4,store-rebalancer,obj=‹cpu›] 323 ran out of leases worth transferring and load ‹(queries-per-second=54.1 cpu-per-second=1.9s)› is still above desired threshold ‹(queries-per-second=78.7 cpu-per-second=365ms)›; considering load-based [...]

This goes on and on. It moves a bunch of leases over, but it doesn't make a dent. All the while, the node is really overloaded, to the point where - I think - it's generally unable to laterally rebalance replicas away.

I never saw it print anything about which ranges are driving the load. I really wish it had.

The auto CPU profiler could've helped here, but it wasn't on, https://github.com/cockroachdb/cockroach/issues/97699 ↩

107690: kvserver: downgrade "unable to transfer lease" log r=kvoli a=tbg It doesn't rise up to "ERROR" level and also doesn't need to log a large stack trace. Seen while looking into #106140. Epic: none Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

tbg · 2023-07-27T11:29:28Z

I filed #107694 to make it easier to figure out what happened here in the future.

tbg · 2023-07-27T11:31:54Z

Moving this over to KV. I'll need some help from the distribution folks to figure out what may have gone wrong here.

kvoli · 2023-07-27T13:31:52Z

@tbg and I paired to look at this failure. There's not enough info from the logs/tsdump to determine where the CPU is coming from that is keeping the the node hot.

We could determine the CPU usage was at least partially coming from replication, as the node didn't have many/any leases yet the replica CPU remained high.

cockroach/pkg/kv/kvserver/replica.go

Line 2357 in 1d3c11e

func (r *Replica) MeasureRaftCPUNanos(start time.Duration) {

Next steps seem like enabling the CPU autoprofiler for this test, or better yet, all non-performance roachtests. Then, waiting for another failure and inspecting the profiles #97699.

kvoli · 2023-08-02T20:41:32Z

Stressed on master with #108048 for 20 runs... no hits :(

I'll bump this to 100 runs.

`restore/tpce/400GB/aws/nodes=8/cpus=8` recently failed with high CPU usage on one node cockroachdb#106140. The high CPU usage was attributed within replication, however we were unable to determine where exactly. Enable the automatic CPU profiler on `restore/*` roachtests, which will collect profiles once normalized CPU utilization exceeds 70%. Note this is supposed to be a temporary addition, which will be subsumed by cockroachdb#97699. Informs: cockroachdb#106140 Release note: None

kvoli · 2023-08-03T20:57:15Z

This only failed on July 5th and July 6th. Doesn't repro after 40 runs. Closing.

cockroach-teamcity added this to the 23.2 milestone Jul 5, 2023

blathers-crl bot added the T-kv KV Team label Jul 5, 2023

dt removed the T-disaster-recovery label Jul 5, 2023

irfansharif mentioned this issue Jul 7, 2023

roachtest: schemachange/index/tpcc/w=800 failed [deadlock in kvflowhandle] #106078

Closed

exalate-issue-sync bot assigned pav-kv Jul 10, 2023

shralex added T-kv-replication and removed T-kv KV Team labels Jul 18, 2023

exalate-issue-sync bot assigned tbg and unassigned pav-kv Jul 19, 2023

tbg closed this as completed Jul 19, 2023

andrewbaptist added the X-duplicate Closed as a duplicate of another issue. label Jul 19, 2023

tbg reopened this Jul 26, 2023

tbg mentioned this issue Jul 27, 2023

kvserver: expected leaseholder store to be in the slice of existing replicas #107691

Open

tbg mentioned this issue Jul 27, 2023

kvserver: load imbalance not observable from logs and debug.zip #107694

Closed

tbg changed the title ~~roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed~~ roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [cpu imbalance] Jul 27, 2023

tbg added A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team and removed T-kv-replication labels Jul 27, 2023

tbg mentioned this issue Jul 27, 2023

tsdump: histograms are missing #107701

Closed

This was referenced Jul 27, 2023

kvserver: store rebalancer lease transfers don't log which range the transfer was for or the impact #107702

Closed

kvserver,allocator: overloaded store with no leases is unable to shed load #107704

Open

kvoli mentioned this issue Jul 27, 2023

server: enable CPU profiler by default - benchmarking tests #97699

Closed

irfansharif added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 28, 2023

irfansharif assigned kvoli Jul 28, 2023

erikgrinaker unassigned tbg Jul 31, 2023

kvoli mentioned this issue Aug 2, 2023

roachtest: enable cpu profiler on restore tests #108048

Closed

kvoli assigned kvoli and unassigned kvoli Aug 3, 2023

kvoli closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [cpu imbalance] #106140

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [cpu imbalance] #106140

cockroach-teamcity commented Jul 5, 2023 •

edited by cockroach-jira-scripts

Loading

dt commented Jul 5, 2023

cockroach-teamcity commented Jul 6, 2023

irfansharif commented Jul 7, 2023

blathers-crl bot commented Jul 18, 2023

tbg commented Jul 19, 2023

irfansharif commented Jul 19, 2023

tbg commented Jul 26, 2023

tbg commented Jul 27, 2023 •

edited

Loading

tbg commented Jul 27, 2023

tbg commented Jul 27, 2023

kvoli commented Jul 27, 2023

kvoli commented Aug 2, 2023

kvoli commented Aug 3, 2023

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [cpu imbalance] #106140

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed [cpu imbalance] #106140

Comments

cockroach-teamcity commented Jul 5, 2023 • edited by cockroach-jira-scripts Loading

dt commented Jul 5, 2023

cockroach-teamcity commented Jul 6, 2023

irfansharif commented Jul 7, 2023

blathers-crl bot commented Jul 18, 2023

tbg commented Jul 19, 2023

irfansharif commented Jul 19, 2023

tbg commented Jul 26, 2023

tbg commented Jul 27, 2023 • edited Loading

Footnotes

tbg commented Jul 27, 2023

tbg commented Jul 27, 2023

kvoli commented Jul 27, 2023

kvoli commented Aug 2, 2023

kvoli commented Aug 3, 2023

cockroach-teamcity commented Jul 5, 2023 •

edited by cockroach-jira-scripts

Loading

tbg commented Jul 27, 2023 •

edited

Loading