-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: disk-stalled/wal-failover/among-stores failed #122364
Comments
@jbowens related to our in-person discussion the p100 of raftlog commit latency is not high |
|
This exposes the wal.FailoverStats.FailoverWriteAndSyncLatency pebble histogram metric, which is the effective latency being observed by WAL writes that need to sync. This allows us to ignore the wal fsync latency metric when trying to diagnose higher user observed latency, when WAL failover is configured. Informs cockroachdb#122364 Epic: none Release note: None
s1 is doing a flush every ~5.5s, so that is the lifetime of a WAL. A |
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ e0068814dfcb4f975a53b79b5546b5fb85c0f927:
Parameters:
Same failure on other branches
|
See #122772 (comment) for a duplicate. |
@jbowens noticed: encryption-at-rest was on, and it turns out it was on for all 4 of the failures. And that we hold Since block cache misses do happen during the stall, and those get serviced by the page cache (there are 0 disk reads), we are likely opening files for read that would succeed if the read did not block on |
Linking to #98051 |
Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file. Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled. This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads. Epic: none Fixes: cockroachdb#98051 Fixes: cockroachdb#122364 Release note: None
123057: fs,engineccl: allow reads to continue when writes are stalled r=raduberinde,jbowens a=sumeerbhola Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file. Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled. This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads. Epic: none Fixes: #98051 Fixes: #122364 Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file. Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled. This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads. Epic: none Fixes: cockroachdb#98051 Fixes: cockroachdb#122364 Release note: None
122700: kvserver: add storage.wal.failover.write_and_sync.latency r=jbowens a=sumeerbhola This exposes the wal.FailoverStats.FailoverWriteAndSyncLatency pebble histogram metric, which is the effective latency being observed by WAL writes that need to sync. This allows us to ignore the wal fsync latency metric when trying to diagnose higher user observed latency, when WAL failover is configured. Informs #122364 Note the ~150ms p100 for this latency, compared to the fsync latency, when running `disk-stalled/wal-failover/among-stores`: <img width="784" alt="Screenshot 2024-04-19 at 2 07 00 PM" src="https://github.com/cockroachdb/cockroach/assets/54990988/a74c2b6e-6d7b-40fa-856c-d7d1be5bc224"> Epic: none Release note: None 123224: profiler: allow 0 value for CPU threshold r=yuzefovich a=yuzefovich In 9036430 we added positive int validation for `server.cpu_profile.cpu_usage_combined_threshold`, but a value of zero also seems reasonable in some cases (the comment on the setting also mentions it), so this commit switches to non-negative int validation instead. Epic: None Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
This exposes the wal.FailoverStats.FailoverWriteAndSyncLatency pebble histogram metric, which is the effective latency being observed by WAL writes that need to sync. This allows us to ignore the wal fsync latency metric when trying to diagnose higher user observed latency, when WAL failover is configured. Informs #122364 Epic: none Release note: None
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 67547d7724f8a52646e2e8ecb3ca48b923957d90:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=true
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-37835
The text was updated successfully, but these errors were encountered: