Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk-stalled/wal-failover/among-stores failed #122364

Closed
cockroach-teamcity opened this issue Apr 15, 2024 · 11 comments · Fixed by #123057
Closed

roachtest: disk-stalled/wal-failover/among-stores failed #122364

cockroach-teamcity opened this issue Apr 15, 2024 · 11 comments · Fixed by #123057
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 15, 2024

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 67547d7724f8a52646e2e8ecb3ca48b923957d90:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.015021925s at 2024-04-15T14:05:00Z
(cluster.go:2348).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-37835

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team labels Apr 15, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Apr 15, 2024
@jbowens
Copy link
Collaborator

jbowens commented Apr 15, 2024

Screenshot 2024-04-15 at 11 11 05 AM Screenshot 2024-04-15 at 11 11 14 AM

https://grafana.testeng.crdb.io/d/StorageAvKxELVz/storage?from=1713186843554&to=1713190919751&var-cluster=teamcity-14842113-1713160087-91-n4cpu16&orgId=1

Hrm; SQL latency spiked but batch commits and raft log commit latency were unperturbed. Did we get blocked on disk somewhere outside the WAL write?

@sumeerbhola
Copy link
Collaborator

There is some perturbation in storage.batch-commit.commit-wait.duration. Unfortunately we don't have a good histogram metric at the storage layer for the "observed" log fsync latency. I'll see if we can add something via recordQueue.pop, since it represents an observed fsync.

Screenshot 2024-04-16 at 11 09 22 AM

@sumeerbhola
Copy link
Collaborator

@jbowens related to our in-person discussion the p100 of raftlog commit latency is not high
Screenshot 2024-04-16 at 12 40 20 PM

@sumeerbhola
Copy link
Collaborator

unexpectedly high p99.99 latency 1.015021925s at 2024-04-15T14:05:00Z

It is AC queueing. Runnable goroutines also spiked, so there was a real backlog of goroutines trying to do work. And it is not due to elastic work.

Screenshot 2024-04-16 at 12 52 48 PM Screenshot 2024-04-16 at 12 54 09 PM Screenshot 2024-04-16 at 12 54 48 PM

@sumeerbhola
Copy link
Collaborator

n1 is servicing 1400 requests/s and had 108 total slots (concurrency), before this latency increase. So KV requests needed to complete quickly for this total slot count to be sufficient. I suspect what happened is that some requests got unlucky and started missing the block cache + page cache and got stuck, which made that total slots insufficient. AC responded by ratcheting up the slots, since the observed runnable goroutines were low, to reduce queueing. n1 went from 108 slots to 2118 slots in 30s (with 1ms ticks, the maximum total slot increase we can see in 30s is 30,000). This was not fast enough to avoid some AC queueing. Then we briefly went into a regime where AC had over-admitted either (a) while the disk stall was still ongoing, since some later requests may not have stalled on reads, (b) or goroutines got unblocked after the disk stall ended (the timeline is not granular enough to differentiate between (a) and (b)), which is why we briefly see the high runnable goroutines. I suspect it is (b) since the workload rate is not high enough to cause cpu overload by itself.

The pmax we see for KV request processing latency is significantly higher than the pmax for AC queueing, which supports the above.
Screenshot 2024-04-16 at 1 28 02 PM

@sumeerbhola
Copy link
Collaborator

sys_host_disk_read_count is consistently 0, so the previous theory is flawed. And the p100 of raftlog commit latency spiked to 4.95s as shown in #122364 (comment). So it is probable that something in CockroachDB or Pebble is observing the disk stall.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Apr 19, 2024
This exposes the wal.FailoverStats.FailoverWriteAndSyncLatency pebble
histogram metric, which is the effective latency being observed by
WAL writes that need to sync. This allows us to ignore the wal fsync
latency metric when trying to diagnose higher user observed latency,
when WAL failover is configured.

Informs cockroachdb#122364

Epic: none

Release note: None
@sumeerbhola
Copy link
Collaborator

s1 is doing a flush every ~5.5s, so that is the lifetime of a WAL. A failoverWriter can switch 10 times, before there are no more switches possible. So one switch every 0.55s is the maximum possible switch interval, which is a rate of 1.8 switches/s. The switch rate at the 14:05 disk stall was higher, but still < 0.05 switches/s. The primary is seeing a slightly higher rate (s/s) of usage during this stall, and looks like there was a flush ongoing hence the high flush utilization. There was no write stall. So it still does not look like a Pebble-level issue. #122700 should help next time this occurs. I am removing the release blocker label.

Screenshot 2024-04-19 at 2 32 50 PM

@sumeerbhola sumeerbhola removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 19, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ e0068814dfcb4f975a53b79b5546b5fb85c0f927:

(disk_stall.go:173).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.211424557s at 2024-04-21T14:58:00Z
(cluster.go:2351).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@sumeerbhola
Copy link
Collaborator

See #122772 (comment) for a duplicate.

@sumeerbhola
Copy link
Collaborator

@jbowens noticed: encryption-at-rest was on, and it turns out it was on for all 4 of the failures. And that we hold FileRegistry.mu when writing, which is also needed for reading.

Since block cache misses do happen during the stall, and those get serviced by the page cache (there are 0 disk reads), we are likely opening files for read that would succeed if the read did not block on FileRegistry.mu.

@jbowens
Copy link
Collaborator

jbowens commented Apr 24, 2024

Linking to #98051

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Apr 25, 2024
Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex
when updating their state, that included write IO to to update persistent
state. This would block readers of the state, specifically file reads
that need a file registry entry and data key for the file to successfully
open and read a file.

Blocking these reads due to slow or stalled write IO is not desirable,
since the read could succeed if the relevant data is in the page cache.
Specifically, with the new WAL failover feature, we expect the store
to keep functioning when disk writes are temporarily stalled, since the
WAL can failover. This expectation is not met if essential reads block on
non-essential writes that are stalled.

This PR changes the locking in the FileRegistry and DataKeyManager to
prevent writes from interfering with concurrent reads.

Epic: none

Fixes: cockroachdb#98051
Fixes: cockroachdb#122364

Release note: None
craig bot pushed a commit that referenced this issue Apr 25, 2024
123057: fs,engineccl: allow reads to continue when writes are stalled r=raduberinde,jbowens a=sumeerbhola

Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file.

Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled.

This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads.

Epic: none

Fixes: #98051
Fixes: #122364

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
@craig craig bot closed this as completed in 6518185 Apr 25, 2024
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Apr 26, 2024
Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex
when updating their state, that included write IO to to update persistent
state. This would block readers of the state, specifically file reads
that need a file registry entry and data key for the file to successfully
open and read a file.

Blocking these reads due to slow or stalled write IO is not desirable,
since the read could succeed if the relevant data is in the page cache.
Specifically, with the new WAL failover feature, we expect the store
to keep functioning when disk writes are temporarily stalled, since the
WAL can failover. This expectation is not met if essential reads block on
non-essential writes that are stalled.

This PR changes the locking in the FileRegistry and DataKeyManager to
prevent writes from interfering with concurrent reads.

Epic: none

Fixes: cockroachdb#98051
Fixes: cockroachdb#122364

Release note: None
craig bot pushed a commit that referenced this issue Apr 29, 2024
122700: kvserver: add storage.wal.failover.write_and_sync.latency r=jbowens a=sumeerbhola

This exposes the wal.FailoverStats.FailoverWriteAndSyncLatency pebble histogram metric, which is the effective latency being observed by WAL writes that need to sync. This allows us to ignore the wal fsync latency metric when trying to diagnose higher user observed latency, when WAL failover is configured.

Informs #122364

Note the ~150ms p100 for this latency, compared to the fsync latency, when running `disk-stalled/wal-failover/among-stores`:
<img width="784" alt="Screenshot 2024-04-19 at 2 07 00 PM" src="https://github.com/cockroachdb/cockroach/assets/54990988/a74c2b6e-6d7b-40fa-856c-d7d1be5bc224">

Epic: none

Release note: None

123224: profiler: allow 0 value for CPU threshold r=yuzefovich a=yuzefovich

In 9036430 we added positive int validation for `server.cpu_profile.cpu_usage_combined_threshold`, but a value of zero also seems reasonable in some cases (the comment on the setting also mentions it), so this commit switches to non-negative int validation instead.

Epic: None

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Apr 29, 2024
This exposes the wal.FailoverStats.FailoverWriteAndSyncLatency pebble
histogram metric, which is the effective latency being observed by
WAL writes that need to sync. This allows us to ignore the wal fsync
latency metric when trying to diagnose higher user observed latency,
when WAL failover is configured.

Informs #122364

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants