Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engineccl: filesystem mutex held during I/O #98051

Closed
jbowens opened this issue Mar 6, 2023 · 0 comments · Fixed by #123057
Closed

engineccl: filesystem mutex held during I/O #98051

jbowens opened this issue Mar 6, 2023 · 0 comments · Fixed by #123057
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-storage Storage Team

Comments

@jbowens
Copy link
Collaborator

jbowens commented Mar 6, 2023

The encryption-at-rest file registry holds a mutex when making changes to the registry. It holds this mutex while writing to and fsyncing the registry. This prevents concurrent file opens from proceeding (eg, from table cache misses or readahead).

Related to #80946.

Jira issue: CRDB-25051

Epic: CRDB-26603

@jbowens jbowens added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Mar 6, 2023
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Apr 25, 2024
Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex
when updating their state, that included write IO to to update persistent
state. This would block readers of the state, specifically file reads
that need a file registry entry and data key for the file to successfully
open and read a file.

Blocking these reads due to slow or stalled write IO is not desirable,
since the read could succeed if the relevant data is in the page cache.
Specifically, with the new WAL failover feature, we expect the store
to keep functioning when disk writes are temporarily stalled, since the
WAL can failover. This expectation is not met if essential reads block on
non-essential writes that are stalled.

This PR changes the locking in the FileRegistry and DataKeyManager to
prevent writes from interfering with concurrent reads.

Epic: none

Fixes: cockroachdb#98051
Fixes: cockroachdb#122364

Release note: None
craig bot pushed a commit that referenced this issue Apr 25, 2024
123057: fs,engineccl: allow reads to continue when writes are stalled r=raduberinde,jbowens a=sumeerbhola

Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file.

Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled.

This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads.

Epic: none

Fixes: #98051
Fixes: #122364

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
@craig craig bot closed this as completed in 6518185 Apr 25, 2024
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Apr 26, 2024
Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex
when updating their state, that included write IO to to update persistent
state. This would block readers of the state, specifically file reads
that need a file registry entry and data key for the file to successfully
open and read a file.

Blocking these reads due to slow or stalled write IO is not desirable,
since the read could succeed if the relevant data is in the page cache.
Specifically, with the new WAL failover feature, we expect the store
to keep functioning when disk writes are temporarily stalled, since the
WAL can failover. This expectation is not met if essential reads block on
non-essential writes that are stalled.

This PR changes the locking in the FileRegistry and DataKeyManager to
prevent writes from interfering with concurrent reads.

Epic: none

Fixes: cockroachdb#98051
Fixes: cockroachdb#122364

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-storage Storage Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant