roachtest: disk-stalled/wal-failover/among-stores failed #122364

cockroach-teamcity · 2024-04-15T14:20:16Z

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 67547d7724f8a52646e2e8ecb3ca48b923957d90:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.015021925s at 2024-04-15T14:05:00Z
(cluster.go:2348).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=true
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!

Jira issue: CRDB-37835}

The text was updated successfully, but these errors were encountered:

jbowens · 2024-04-15T15:12:19Z

https://grafana.testeng.crdb.io/d/StorageAvKxELVz/storage?from=1713186843554&to=1713190919751&var-cluster=teamcity-14842113-1713160087-91-n4cpu16&orgId=1

Hrm; SQL latency spiked but batch commits and raft log commit latency were unperturbed. Did we get blocked on disk somewhere outside the WAL write?

sumeerbhola · 2024-04-16T15:13:16Z

There is some perturbation in storage.batch-commit.commit-wait.duration. Unfortunately we don't have a good histogram metric at the storage layer for the "observed" log fsync latency. I'll see if we can add something via recordQueue.pop, since it represents an observed fsync.

sumeerbhola · 2024-04-16T16:41:41Z

@jbowens related to our in-person discussion the p100 of raftlog commit latency is not high

sumeerbhola · 2024-04-16T16:56:56Z

unexpectedly high p99.99 latency 1.015021925s at 2024-04-15T14:05:00Z

It is AC queueing. Runnable goroutines also spiked, so there was a real backlog of goroutines trying to do work. And it is not due to elastic work.

sumeerbhola · 2024-04-16T17:28:43Z

n1 is servicing 1400 requests/s and had 108 total slots (concurrency), before this latency increase. So KV requests needed to complete quickly for this total slot count to be sufficient. I suspect what happened is that some requests got unlucky and started missing the block cache + page cache and got stuck, which made that total slots insufficient. AC responded by ratcheting up the slots, since the observed runnable goroutines were low, to reduce queueing. n1 went from 108 slots to 2118 slots in 30s (with 1ms ticks, the maximum total slot increase we can see in 30s is 30,000). This was not fast enough to avoid some AC queueing. Then we briefly went into a regime where AC had over-admitted either (a) while the disk stall was still ongoing, since some later requests may not have stalled on reads, (b) or goroutines got unblocked after the disk stall ended (the timeline is not granular enough to differentiate between (a) and (b)), which is why we briefly see the high runnable goroutines. I suspect it is (b) since the workload rate is not high enough to cause cpu overload by itself.

The pmax we see for KV request processing latency is significantly higher than the pmax for AC queueing, which supports the above.

sumeerbhola · 2024-04-16T20:18:00Z

sys_host_disk_read_count is consistently 0, so the previous theory is flawed. And the p100 of raftlog commit latency spiked to 4.95s as shown in #122364 (comment). So it is probable that something in CockroachDB or Pebble is observing the disk stall.

This exposes the wal.FailoverStats.FailoverWriteAndSyncLatency pebble histogram metric, which is the effective latency being observed by WAL writes that need to sync. This allows us to ignore the wal fsync latency metric when trying to diagnose higher user observed latency, when WAL failover is configured. Informs cockroachdb#122364 Epic: none Release note: None

sumeerbhola · 2024-04-19T18:37:58Z

s1 is doing a flush every ~5.5s, so that is the lifetime of a WAL. A failoverWriter can switch 10 times, before there are no more switches possible. So one switch every 0.55s is the maximum possible switch interval, which is a rate of 1.8 switches/s. The switch rate at the 14:05 disk stall was higher, but still < 0.05 switches/s. The primary is seeing a slightly higher rate (s/s) of usage during this stall, and looks like there was a flush ongoing hence the high flush utilization. There was no write stall. So it still does not look like a Pebble-level issue. #122700 should help next time this occurs. I am removing the release blocker label.

cockroach-teamcity · 2024-04-21T15:13:35Z

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ e0068814dfcb4f975a53b79b5546b5fb85c0f927:

(disk_stall.go:173).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.211424557s at 2024-04-21T14:58:00Z
(cluster.go:2351).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=true
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

roachtest: disk-stalled/wal-failover/among-stores failed #122772 roachtest: disk-stalled/wal-failover/among-stores failed [C-test-failure O-roachtest O-robot T-storage branch-release-24.1 release-blocker]

_{This test on roachdash | Improve this report!}

sumeerbhola · 2024-04-23T16:24:41Z

See #122772 (comment) for a duplicate.

sumeerbhola · 2024-04-24T17:51:29Z

@jbowens noticed: encryption-at-rest was on, and it turns out it was on for all 4 of the failures. And that we hold FileRegistry.mu when writing, which is also needed for reading.

Since block cache misses do happen during the stall, and those get serviced by the page cache (there are 0 disk reads), we are likely opening files for read that would succeed if the read did not block on FileRegistry.mu.

jbowens · 2024-04-24T18:11:35Z

Linking to #98051

Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file. Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled. This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads. Epic: none Fixes: cockroachdb#98051 Fixes: cockroachdb#122364 Release note: None

123057: fs,engineccl: allow reads to continue when writes are stalled r=raduberinde,jbowens a=sumeerbhola Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file. Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled. This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads. Epic: none Fixes: #98051 Fixes: #122364 Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>

Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file. Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled. This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads. Epic: none Fixes: cockroachdb#98051 Fixes: cockroachdb#122364 Release note: None

122700: kvserver: add storage.wal.failover.write_and_sync.latency r=jbowens a=sumeerbhola This exposes the wal.FailoverStats.FailoverWriteAndSyncLatency pebble histogram metric, which is the effective latency being observed by WAL writes that need to sync. This allows us to ignore the wal fsync latency metric when trying to diagnose higher user observed latency, when WAL failover is configured. Informs #122364 Note the ~150ms p100 for this latency, compared to the fsync latency, when running `disk-stalled/wal-failover/among-stores`: <img width="784" alt="Screenshot 2024-04-19 at 2 07 00 PM" src="https://github.com/cockroachdb/cockroach/assets/54990988/a74c2b6e-6d7b-40fa-856c-d7d1be5bc224"> Epic: none Release note: None 123224: profiler: allow 0 value for CPU threshold r=yuzefovich a=yuzefovich In 9036430 we added positive int validation for `server.cpu_profile.cpu_usage_combined_threshold`, but a value of zero also seems reasonable in some cases (the comment on the setting also mentions it), so this commit switches to non-negative int validation instead. Epic: None Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

This exposes the wal.FailoverStats.FailoverWriteAndSyncLatency pebble histogram metric, which is the effective latency being observed by WAL writes that need to sync. This allows us to ignore the wal fsync latency metric when trying to diagnose higher user observed latency, when WAL failover is configured. Informs #122364 Epic: none Release note: None

cockroach-teamcity added this to the 24.1 milestone Apr 15, 2024

nicktrav assigned sumeerbhola Apr 16, 2024

sumeerbhola mentioned this issue Apr 19, 2024

kvserver: add storage.wal.failover.write_and_sync.latency #122700

Merged

sumeerbhola removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 19, 2024

cockroach-teamcity mentioned this issue Apr 21, 2024

roachtest: disk-stalled/wal-failover/among-stores failed #122772

Closed

cockroach-teamcity mentioned this issue Apr 25, 2024

roachtest: disk-stalled/wal-failover/among-stores failed #123056

Closed

sumeerbhola mentioned this issue Apr 25, 2024

fs,engineccl: allow reads to continue when writes are stalled #123057

Merged

craig bot closed this as completed in 6518185 Apr 25, 2024

sumeerbhola mentioned this issue Apr 26, 2024

release-24.1: fs,engineccl: allow reads to continue when writes are stalled #123109

Merged

blathers-crl bot mentioned this issue Apr 29, 2024

release-24.1: kvserver: add storage.wal.failover.write_and_sync.latency #123232

Merged

sumeerbhola mentioned this issue May 19, 2024

roachtest: disk-stalled/wal-failover/among-stores failed #124399

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: disk-stalled/wal-failover/among-stores failed #122364

roachtest: disk-stalled/wal-failover/among-stores failed #122364

cockroach-teamcity commented Apr 15, 2024 •

edited by cockroach-jira-scripts

Loading

jbowens commented Apr 15, 2024

sumeerbhola commented Apr 16, 2024

sumeerbhola commented Apr 16, 2024

sumeerbhola commented Apr 16, 2024

sumeerbhola commented Apr 16, 2024

sumeerbhola commented Apr 16, 2024

sumeerbhola commented Apr 19, 2024

cockroach-teamcity commented Apr 21, 2024

sumeerbhola commented Apr 23, 2024

sumeerbhola commented Apr 24, 2024

jbowens commented Apr 24, 2024

roachtest: disk-stalled/wal-failover/among-stores failed #122364

roachtest: disk-stalled/wal-failover/among-stores failed #122364

Comments

cockroach-teamcity commented Apr 15, 2024 • edited by cockroach-jira-scripts Loading

jbowens commented Apr 15, 2024

sumeerbhola commented Apr 16, 2024

sumeerbhola commented Apr 16, 2024

sumeerbhola commented Apr 16, 2024

sumeerbhola commented Apr 16, 2024

sumeerbhola commented Apr 16, 2024

sumeerbhola commented Apr 19, 2024

cockroach-teamcity commented Apr 21, 2024

sumeerbhola commented Apr 23, 2024

sumeerbhola commented Apr 24, 2024

jbowens commented Apr 24, 2024

cockroach-teamcity commented Apr 15, 2024 •

edited by cockroach-jira-scripts

Loading