Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs,engineccl: allow reads to continue when writes are stalled #123057

Merged
merged 1 commit into from
Apr 25, 2024

Conversation

sumeerbhola
Copy link
Collaborator

Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex when updating their state, that included write IO to to update persistent state. This would block readers of the state, specifically file reads that need a file registry entry and data key for the file to successfully open and read a file.

Blocking these reads due to slow or stalled write IO is not desirable, since the read could succeed if the relevant data is in the page cache. Specifically, with the new WAL failover feature, we expect the store to keep functioning when disk writes are temporarily stalled, since the WAL can failover. This expectation is not met if essential reads block on non-essential writes that are stalled.

This PR changes the locking in the FileRegistry and DataKeyManager to prevent writes from interfering with concurrent reads.

Epic: none

Fixes: #98051
Fixes: #122364

Release note: None

@sumeerbhola sumeerbhola added the backport-24.1.x Flags PRs that need to be backported to 24.1. label Apr 25, 2024
@sumeerbhola sumeerbhola requested a review from a team as a code owner April 25, 2024 14:06
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 6 of 9 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/fs/file_registry.go line 87 at r1 (raw file):

	writeMu struct {
		syncutil.Mutex
		mu struct {

nit: I find the naming of the mutexes confusing, but I don't really have any suggestions. Is there some way to name these to give more of a sense that writeMu and writeMu.mu are distinct and convey the critical sections they guard? maybe s/writeMu/metaMu/ and s/mu/entriesMu/?

Both fs.FileRegistry and engineccl.DataKeyManager held an internal mutex
when updating their state, that included write IO to to update persistent
state. This would block readers of the state, specifically file reads
that need a file registry entry and data key for the file to successfully
open and read a file.

Blocking these reads due to slow or stalled write IO is not desirable,
since the read could succeed if the relevant data is in the page cache.
Specifically, with the new WAL failover feature, we expect the store
to keep functioning when disk writes are temporarily stalled, since the
WAL can failover. This expectation is not met if essential reads block on
non-essential writes that are stalled.

This PR changes the locking in the FileRegistry and DataKeyManager to
prevent writes from interfering with concurrent reads.

Epic: none

Fixes: cockroachdb#98051
Fixes: cockroachdb#122364

Release note: None
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens)


pkg/storage/fs/file_registry.go line 87 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: I find the naming of the mutexes confusing, but I don't really have any suggestions. Is there some way to name these to give more of a sense that writeMu and writeMu.mu are distinct and convey the critical sections they guard? maybe s/writeMu/metaMu/ and s/mu/entriesMu/?

I am not super happy with the names either. I prefer to keep writeMu since it is the one definitely needed for writes, but not necessary for reads. I initially had mu called readMu, but that was slightly misleading since writers also need it, and one can read without it (by holding writeMu). So I ended up with this.

I tweaked the comment a bit to say writeMu.mu.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @sumeerbhola)


pkg/storage/fs/file_registry.go line 87 at r1 (raw file):

Previously, sumeerbhola wrote…

I am not super happy with the names either. I prefer to keep writeMu since it is the one definitely needed for writes, but not necessary for reads. I initially had mu called readMu, but that was slightly misleading since writers also need it, and one can read without it (by holding writeMu). So I ended up with this.

I tweaked the comment a bit to say writeMu.mu.

Do we need getters to be able to read from entries while holding just writeMu? If we require readers to just hold the RWMutex, that would make things a lot simpler

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @RaduBerinde)


pkg/storage/fs/file_registry.go line 87 at r1 (raw file):

Do we need getters to be able to read from entries while holding just writeMu?

Hmm, I don't quite understand. I'll write out the line of thinking that motivated this:

We want:

  • a single writer: the writer reads the in-memory state to compute the mutation. Then does IO. Then updates the in-memory state.
  • readers to be able to read the in-memory state while the write is in-progress.

So we need a mutex that will be held by the writer throughout its in-progress write. That is writeMu.
We need a separate mutex for reads to continue while the write is in-progress. That is mu. Naturally, mu also needs to be acquired when mutating the in-memory state.

So one can have the pattern:

  • writeMu to serialize writers
  • mu to read/write the in-memory state.

With writeMu > mu.

But acquiring mu to read during a write is (a) not really needed, (b) it is not providing any needed guarantee, since the writer is making decisions about the mutation based on the read, without holding mu after doing the read. The guarantee of no staleness in the mutation (or lost mutations) is coming from writerMu. So there isn't any need to hold mu when reading. This also improves code readability since we don't have to hide every read of the in-memory state during the write inside a closure with a mu acquisition and deferreed release.

@RaduBerinde
Copy link
Member

pkg/storage/fs/file_registry.go line 87 at r1 (raw file):

Previously, sumeerbhola wrote…

Do we need getters to be able to read from entries while holding just writeMu?

Hmm, I don't quite understand. I'll write out the line of thinking that motivated this:

We want:

  • a single writer: the writer reads the in-memory state to compute the mutation. Then does IO. Then updates the in-memory state.
  • readers to be able to read the in-memory state while the write is in-progress.

So we need a mutex that will be held by the writer throughout its in-progress write. That is writeMu.
We need a separate mutex for reads to continue while the write is in-progress. That is mu. Naturally, mu also needs to be acquired when mutating the in-memory state.

So one can have the pattern:

  • writeMu to serialize writers
  • mu to read/write the in-memory state.

With writeMu > mu.

But acquiring mu to read during a write is (a) not really needed, (b) it is not providing any needed guarantee, since the writer is making decisions about the mutation based on the read, without holding mu after doing the read. The guarantee of no staleness in the mutation (or lost mutations) is coming from writerMu. So there isn't any need to hold mu when reading. This also improves code readability since we don't have to hide every read of the in-memory state during the write inside a closure with a mu acquisition and deferreed release.

That makes sense, but then I'd just separate out the entries from under writeMu:

writeMu struct {
  syncutil.Mutex
  registryFile etc
}

entriesMu struct {
  syncutil.RWMutex
  entries map[string]*FileEntry
}

The RWMutex just protects the entires map, the other data is protected by writeMu. The fact that we have to hold writeMu for a longer time while then also acquiring entries is fine, but that just be a detail of how to implement that write correctly (readers who just want to read from entries don't have to care about that detail).

@sumeerbhola
Copy link
Collaborator Author

That makes sense, but then I'd just separate out the entries from under writeMu:

But then it creates confusion if the code accesses entriesMu.entries without acquiring entriesMu, which I explicitly do want to do. Since acquiring entriesMu affects readability of the code logic just to conform to a common language idiom that is getting in our way here. Putting the mutexes in a hierarchy maybe helps a bit, since at least it allows us to comment on their relationship in a more easily discoverable way.

An example of this readability issue are the two implementations below (I picked DataKeyManager since I think it suffers from this problem more than FileRegistry). The former is what we currently do, and the latter is if we required mu for reading (instead of the closures, I can define getter functions on DataKeyManager, but it's the same problem).

func (m *DataKeyManager) ActiveKeyForWriter(ctx context.Context) (*enginepbccl.SecretKey, error) {
	m.writeMu.Lock()
	defer m.writeMu.Unlock()
	if m.writeMu.rotationEnabled {
		now := kmTimeNow().Unix()
		if now-m.writeMu.mu.activeKey.Info.CreationTime > m.rotationPeriod {
			keyRegistry := makeRegistryProto()
			proto.Merge(keyRegistry, m.writeMu.mu.keyRegistry)
			if err := m.rotateDataKeyAndWrite(ctx, keyRegistry); err != nil {
				return nil, err
			}
		}
	}
	return m.writeMu.mu.activeKey, nil
}

func (m *DataKeyManager) ActiveKeyForWriter(ctx context.Context) (*enginepbccl.SecretKey, error) {
	m.writeMu.Lock()
	defer m.writeMu.Unlock()
	if m.writeMu.rotationEnabled {
		now := kmTimeNow().Unix()
		creationTime := func() int64 {
			m.writeMu.mu.RLock()
			defer m.writeMu.mu.RUnlock()
			return m.writeMu.mu.activeKey.Info.CreationTime
		}
		if now-creationTime() > m.rotationPeriod {
			keyRegistry := makeRegistryProto()
			existingKeyRegistry := func() *enginepbccl.DataKeysRegistry {
				m.writeMu.mu.RLock()
				defer m.writeMu.mu.RUnlock()
				return m.writeMu.mu.keyRegistry
			}
			proto.Merge(keyRegistry, existingKeyRegistry())
			if err := m.rotateDataKeyAndWrite(ctx, keyRegistry); err != nil {
				return nil, err
			}
		}
	}
	return m.writeMu.mu.activeKey, nil
}

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I understand now.

What is the hot path for reads here? It's the GetKey function? We could use a sync.Map (which based on the documentation would probably be faster than a map + RWMutex in our case) for the entries. activeKey could also be an atomic pointer.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jbowens and @sumeerbhola)

@sumeerbhola
Copy link
Collaborator Author

What is the hot path for reads here?

There isn't really a hot path, which is why the existing mutex code was good enough until we realized we can't block these reads because of stalled writes, since WAL failover depends on the reads continuing to work. We will do a read when creating a sstable.Reader, which is uncommon given the TableCache.

@sumeerbhola
Copy link
Collaborator Author

TFTRs!

@sumeerbhola
Copy link
Collaborator Author

bors r=raduberinde,jbowens

@craig craig bot merged commit 5eda748 into cockroachdb:master Apr 25, 2024
22 checks passed
Copy link

blathers-crl bot commented Apr 25, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 6518185 to blathers/backport-release-24.1-123057: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.1.x Flags PRs that need to be backported to 24.1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: disk-stalled/wal-failover/among-stores failed engineccl: filesystem mutex held during I/O
4 participants