fs,engineccl: allow reads to continue when writes are stalled #123057

sumeerbhola · 2024-04-25T14:06:40Z

Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file.

Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled.

This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads.

Epic: none

Fixes: #98051
Fixes: #122364

Release note: None

cockroach-teamcity · 2024-04-25T14:06:49Z

This change is

jbowens

Reviewed 6 of 9 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/storage/fs/file_registry.go line 87 at r1 (raw file):

	writeMu struct {
		syncutil.Mutex
		mu struct {

nit: I find the naming of the mutexes confusing, but I don't really have any suggestions. Is there some way to name these to give more of a sense that writeMu and writeMu.mu are distinct and convey the critical sections they guard? maybe s/writeMu/metaMu/ and s/mu/entriesMu/?

Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file. Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled. This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads. Epic: none Fixes: cockroachdb#98051 Fixes: cockroachdb#122364 Release note: None

sumeerbhola

TFTR!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jbowens)

pkg/storage/fs/file_registry.go line 87 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: I find the naming of the mutexes confusing, but I don't really have any suggestions. Is there some way to name these to give more of a sense that writeMu and writeMu.mu are distinct and convey the critical sections they guard? maybe s/writeMu/metaMu/ and s/mu/entriesMu/?

I am not super happy with the names either. I prefer to keep writeMu since it is the one definitely needed for writes, but not necessary for reads. I initially had mu called readMu, but that was slightly misleading since writers also need it, and one can read without it (by holding writeMu). So I ended up with this.

I tweaked the comment a bit to say writeMu.mu.

RaduBerinde

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @sumeerbhola)

pkg/storage/fs/file_registry.go line 87 at r1 (raw file):

Previously, sumeerbhola wrote…

I am not super happy with the names either. I prefer to keep writeMu since it is the one definitely needed for writes, but not necessary for reads. I initially had mu called readMu, but that was slightly misleading since writers also need it, and one can read without it (by holding writeMu). So I ended up with this.

I tweaked the comment a bit to say writeMu.mu.

Do we need getters to be able to read from entries while holding just writeMu? If we require readers to just hold the RWMutex, that would make things a lot simpler

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @RaduBerinde)

pkg/storage/fs/file_registry.go line 87 at r1 (raw file):

Do we need getters to be able to read from entries while holding just writeMu?

Hmm, I don't quite understand. I'll write out the line of thinking that motivated this:

We want:

a single writer: the writer reads the in-memory state to compute the mutation. Then does IO. Then updates the in-memory state.
readers to be able to read the in-memory state while the write is in-progress.

So we need a mutex that will be held by the writer throughout its in-progress write. That is writeMu.
We need a separate mutex for reads to continue while the write is in-progress. That is mu. Naturally, mu also needs to be acquired when mutating the in-memory state.

So one can have the pattern:

writeMu to serialize writers
mu to read/write the in-memory state.

With writeMu > mu.

But acquiring mu to read during a write is (a) not really needed, (b) it is not providing any needed guarantee, since the writer is making decisions about the mutation based on the read, without holding mu after doing the read. The guarantee of no staleness in the mutation (or lost mutations) is coming from writerMu. So there isn't any need to hold mu when reading. This also improves code readability since we don't have to hide every read of the in-memory state during the write inside a closure with a mu acquisition and deferreed release.

RaduBerinde · 2024-04-25T18:24:38Z

pkg/storage/fs/file_registry.go line 87 at r1 (raw file):

Previously, sumeerbhola wrote…

Do we need getters to be able to read from entries while holding just writeMu?

Hmm, I don't quite understand. I'll write out the line of thinking that motivated this:

We want:

a single writer: the writer reads the in-memory state to compute the mutation. Then does IO. Then updates the in-memory state.

readers to be able to read the in-memory state while the write is in-progress.

So we need a mutex that will be held by the writer throughout its in-progress write. That is writeMu.
We need a separate mutex for reads to continue while the write is in-progress. That is mu. Naturally, mu also needs to be acquired when mutating the in-memory state.

So one can have the pattern:

writeMu to serialize writers

mu to read/write the in-memory state.

With writeMu > mu.

But acquiring mu to read during a write is (a) not really needed, (b) it is not providing any needed guarantee, since the writer is making decisions about the mutation based on the read, without holding mu after doing the read. The guarantee of no staleness in the mutation (or lost mutations) is coming from writerMu. So there isn't any need to hold mu when reading. This also improves code readability since we don't have to hide every read of the in-memory state during the write inside a closure with a mu acquisition and deferreed release.

That makes sense, but then I'd just separate out the entries from under writeMu:

writeMu struct {
  syncutil.Mutex
  registryFile etc
}

entriesMu struct {
  syncutil.RWMutex
  entries map[string]*FileEntry
}

The RWMutex just protects the entires map, the other data is protected by writeMu. The fact that we have to hold writeMu for a longer time while then also acquiring entries is fine, but that just be a detail of how to implement that write correctly (readers who just want to read from entries don't have to care about that detail).

sumeerbhola · 2024-04-25T19:37:28Z

That makes sense, but then I'd just separate out the entries from under writeMu:

But then it creates confusion if the code accesses entriesMu.entries without acquiring entriesMu, which I explicitly do want to do. Since acquiring entriesMu affects readability of the code logic just to conform to a common language idiom that is getting in our way here. Putting the mutexes in a hierarchy maybe helps a bit, since at least it allows us to comment on their relationship in a more easily discoverable way.

An example of this readability issue are the two implementations below (I picked DataKeyManager since I think it suffers from this problem more than FileRegistry). The former is what we currently do, and the latter is if we required mu for reading (instead of the closures, I can define getter functions on DataKeyManager, but it's the same problem).

func (m *DataKeyManager) ActiveKeyForWriter(ctx context.Context) (*enginepbccl.SecretKey, error) {
	m.writeMu.Lock()
	defer m.writeMu.Unlock()
	if m.writeMu.rotationEnabled {
		now := kmTimeNow().Unix()
		if now-m.writeMu.mu.activeKey.Info.CreationTime > m.rotationPeriod {
			keyRegistry := makeRegistryProto()
			proto.Merge(keyRegistry, m.writeMu.mu.keyRegistry)
			if err := m.rotateDataKeyAndWrite(ctx, keyRegistry); err != nil {
				return nil, err
			}
		}
	}
	return m.writeMu.mu.activeKey, nil
}

func (m *DataKeyManager) ActiveKeyForWriter(ctx context.Context) (*enginepbccl.SecretKey, error) {
	m.writeMu.Lock()
	defer m.writeMu.Unlock()
	if m.writeMu.rotationEnabled {
		now := kmTimeNow().Unix()
		creationTime := func() int64 {
			m.writeMu.mu.RLock()
			defer m.writeMu.mu.RUnlock()
			return m.writeMu.mu.activeKey.Info.CreationTime
		}
		if now-creationTime() > m.rotationPeriod {
			keyRegistry := makeRegistryProto()
			existingKeyRegistry := func() *enginepbccl.DataKeysRegistry {
				m.writeMu.mu.RLock()
				defer m.writeMu.mu.RUnlock()
				return m.writeMu.mu.keyRegistry
			}
			proto.Merge(keyRegistry, existingKeyRegistry())
			if err := m.rotateDataKeyAndWrite(ctx, keyRegistry); err != nil {
				return nil, err
			}
		}
	}
	return m.writeMu.mu.activeKey, nil
}

RaduBerinde

Thanks, I understand now.

What is the hot path for reads here? It's the GetKey function? We could use a sync.Map (which based on the documentation would probably be faster than a map + RWMutex in our case) for the entries. activeKey could also be an atomic pointer.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @sumeerbhola)

sumeerbhola · 2024-04-25T22:58:12Z

What is the hot path for reads here?

There isn't really a hot path, which is why the existing mutex code was good enough until we realized we can't block these reads because of stalled writes, since WAL failover depends on the reads continuing to work. We will do a read when creating a sstable.Reader, which is uncommon given the TableCache.

sumeerbhola · 2024-04-25T22:59:19Z

TFTRs!

sumeerbhola · 2024-04-25T22:59:44Z

bors r=raduberinde,jbowens

craig · 2024-04-25T23:30:40Z

Build succeeded:

blathers-crl · 2024-04-25T23:30:45Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 6518185 to blathers/backport-release-24.1-123057: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

sumeerbhola added the backport-24.1.x Flags PRs that need to be backported to 24.1. label Apr 25, 2024

sumeerbhola requested a review from jbowens April 25, 2024 14:06

sumeerbhola requested a review from a team as a code owner April 25, 2024 14:06

jbowens approved these changes Apr 25, 2024

View reviewed changes

sumeerbhola force-pushed the encryptedfs_rw_concurrency branch from 47954b1 to 6518185 Compare April 25, 2024 16:21

sumeerbhola requested a review from jbowens April 25, 2024 16:22

sumeerbhola commented Apr 25, 2024

View reviewed changes

RaduBerinde reviewed Apr 25, 2024

View reviewed changes

sumeerbhola requested a review from RaduBerinde April 25, 2024 18:17

sumeerbhola commented Apr 25, 2024

View reviewed changes

RaduBerinde approved these changes Apr 25, 2024

View reviewed changes

craig bot merged commit 5eda748 into cockroachdb:master Apr 25, 2024
22 checks passed

sumeerbhola mentioned this pull request Apr 26, 2024

release-24.1: fs,engineccl: allow reads to continue when writes are stalled #123109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs,engineccl: allow reads to continue when writes are stalled #123057

fs,engineccl: allow reads to continue when writes are stalled #123057

sumeerbhola commented Apr 25, 2024

cockroach-teamcity commented Apr 25, 2024

jbowens left a comment

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

RaduBerinde commented Apr 25, 2024

sumeerbhola commented Apr 25, 2024

RaduBerinde left a comment

sumeerbhola commented Apr 25, 2024

sumeerbhola commented Apr 25, 2024

sumeerbhola commented Apr 25, 2024

craig bot commented Apr 25, 2024

blathers-crl bot commented Apr 25, 2024

fs,engineccl: allow reads to continue when writes are stalled #123057

fs,engineccl: allow reads to continue when writes are stalled #123057

Conversation

sumeerbhola commented Apr 25, 2024

cockroach-teamcity commented Apr 25, 2024

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde commented Apr 25, 2024

sumeerbhola commented Apr 25, 2024

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola commented Apr 25, 2024

sumeerbhola commented Apr 25, 2024

sumeerbhola commented Apr 25, 2024

craig bot commented Apr 25, 2024

blathers-crl bot commented Apr 25, 2024