compaction: don't split outputs within a user key #1470

jbowens · 2022-01-24T21:32:12Z

During a compaction, if the current sstable hits the file size limit, defer
finishing the sstable if the next sstable would share a user key. This is the
current behavior of flushes, and this change brings parity between the two.

This change is motivated by introduction of range keys (see #1339). This
ensures we can always cleanly truncate range keys that span sstable boundaries.

This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit
splitting sstables in the middle of a user key, the Fragmenter's FlushTo
function is unnecessary. Compactions and flushes always use the
TruncateAndFlushTo variant.

This change required a tweak to the way grandparent limits are applied, in
order to switch the grandparent splitter's comparison into a >= comparsion.
This was necessary due to the shift in interpreting splitterSuggestions as
exclusive boundaries.

Close #734.

cockroach-teamcity · 2022-01-24T21:32:18Z

This change is

jbowens

Reviewable status: 0 of 15 files reviewed, 2 unresolved discussions

compaction.go, line 161 at r2 (raw file):

	// serve as an *exclusive* end boundary truncation point. If we used > then,
	// we may have already added a key with the user key `g.limit` to the
	// previous sstable.

We never hit this with L0 outputs because if key == nil && splitL0Outputs, we flush all of the tombstones to the current sstable.

compaction.go, line 2477 at r2 (raw file):

			//    (like the grandparent limit is), so that we're guaranteed
			//    that splitterSuggestion is greater than the previous user key
			//    added to the sstable.

Is this possible? When the sublevels return an L0 limit x is x the first key that we don't want to overlap, or the last key of the interval that we may overlap?

If we can rework it such that x is the first key that we don't want our current output to overlap with and enforce it as such, then it's safe to use splitKey = x even when we we've exhausted point keys, because we'll know the previous user key must have been < x.

nicktrav

Love this cleanup / consolidation. Did a first pass and left some comment

Reviewed 15 of 15 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

-- commits, line 2 at r2:
Can probably link #734 to this too, yeah? Would this close that issue completely?

-- commits, line 4 at r2:
nit: s/is hit/hits/

compaction.go, line 858 at r2 (raw file):

// level N. See maxGrandparentOverlapBytes.
//
// TODO(peter): Stopping compaction output in the middle of a user-key creates

TODO-- nice!

compaction_picker.go, line 381 at r2 (raw file):

// sstables. We could perform a migration, re-compacting any sstables with split
// user keys, which would allow us to remove atomic compaction unit expansion
// code.

Seems reasonable. Worth opening a ticket that we can track this against?

docs/range_deletions.md, line 3 at r2 (raw file):

# Range Deletions

TODO: The following explanation of range deletions does not take into account

Is this like a follow-up thing? Could that section be removed here?

internal/keyspan/fragmenter.go, line 286 at r2 (raw file):

//
// The span [a,f) does not cover the key f.
func (f *Fragmenter) FlushTo(key []byte) {

Given we're removing this method, is worth rebranding the existing method as FlushTo or just Flush, now that there's no ambiguity?

testdata/manual_compaction, line 447 at r2 (raw file):

----
2:
  000007:[a#4,SET-a#4,SET]

The comment mentions sstable (8) and sstable (9). That didn't seem accurate?

Also, we no longer seem to be adjusting the seqnum of the RANGEDEL.

testdata/manual_compaction, line 579 at r2 (raw file):

2:
  000007:[a#3,SET-c#72057594037927935,RANGEDEL]
  000008:[c#3,RANGEDEL-e#72057594037927935,RANGEDEL]

The test comment above mentions "sstable 8". And are we still tweaking the key kind?

testdata/manual_compaction, line 614 at r2 (raw file):

----
2:
  000006:[a#3,SET-e#72057594037927935,RANGEDEL]

Is the comment block for this test still accurate? We are no longer "cutting", instead outputting a single table.

testdata/manual_compaction_set_with_del, line 95 at r2 (raw file):

----
2:
  000008:[a#3,SETWITHDEL-c#72057594037927935,RANGEDEL]

The copy pasta in this file from the previous one probably means "ditto" for all my previous comments :(

jbowens

Reviewable status: 12 of 15 files reviewed, 5 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)

-- commits, line 2 at r2:

Previously, nicktrav (Nick Travers) wrote…

Can probably link #734 to this too, yeah? Would this close that issue completely?

Good call, done.

docs/range_deletions.md, line 3 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Is this like a follow-up thing? Could that section be removed here?

Yeah, I think much of the discussion of how range deletions work in this document should be updated. We probably to also want to preserve some of the historical discussion of how they used to work, at least for as long as we're supporting stores that may contain them.

itsbilal

really nice cleanup!

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)

compaction.go, line 2477 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Is this possible? When the sublevels return an L0 limit x is x the first key that we don't want to overlap, or the last key of the interval that we may overlap?

If we can rework it such that x is the first key that we don't want our current output to overlap with and enforce it as such, then it's safe to use splitKey = x even when we we've exhausted point keys, because we'll know the previous user key must have been < x.

We should change the L0 limit splitter to also return splitNow/true if key >= limit (currently it's just key > limit). That's actually more in line with how we want L0 limits to work; we want limit to not be in the left-side sstable. We didn't address this well in the past because a lot of this code had the expectation baked in that we only cut the output once you exceed limit.

compaction.go, line 197 at r3 (raw file):

		// beginning of the first file.
		n := len(g.ve.NewFiles)
		g.limit = g.c.findGrandparentLimit(g.ve.NewFiles[n-1].Meta.Largest.UserKey)

Given the simpler semantics of output splitting now, maybe we can replace this line with this line from below:

g.limit = g.c.findGrandparentLimit(g.c.rangeDelFrag.Start())

Thoughts?

compaction.go, line 2289 at r3 (raw file):

					// equal to the previous table's largest key user key, and
					// the previous table's largest key is not exclusive. This
					// violates the invariant that tables are key-space

Nice!

internal/keyspan/fragmenter.go, line 286 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Given we're removing this method, is worth rebranding the existing method as FlushTo or just Flush, now that there's no ambiguity?

I think it's still a good idea to keep it TruncateAndFlushTo, just to make it clear that we're also truncating range tombstones.

sumeerbhola

Reviewed 4 of 15 files at r1, 1 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @jbowens and @nicktrav)

compaction.go, line 161 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

We never hit this with L0 outputs because if key == nil && splitL0Outputs, we flush all of the tombstones to the current sstable.

Isn't that a deficiency of our L0 splitting scheme?

compaction.go, line 2477 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

We should change the L0 limit splitter to also return splitNow/true if key >= limit (currently it's just key > limit). That's actually more in line with how we want L0 limits to work; we want limit to not be in the left-side sstable. We didn't address this well in the past because a lot of this code had the expectation baked in that we only cut the output once you exceed limit.

This will unify the two cases, and cause L0 outputs to also split tombstones across many sstables (even if we've run out of point keys), yes? If so, that's what I was hoping we were going to get with this simplification.

compaction.go, line 168 at r3 (raw file):

}

func (g *grandparentLimitSplitter) onNewOutput(key *InternalKey) []byte {

l0LimitSplitter.onNewOutput is simple because it doesn't need to be worry about a user key spanning sstables. I was hoping this PR would make grandparentLimitSplitter be essentially identical to that except that it uses findGrandparentLimit. We could even combine the two splitters and parameterize it with a func. Am I missing something?

compaction.go, line 871 at r3 (raw file):

		// this function for how this is used.
		if overlappedBytes > c.maxOverlapBytes && c.cmp(start, f.Smallest.UserKey) < 0 {
			return f.Smallest.UserKey

why this change to use f.Smallest? Also, we've included this sstable in the overlappedBytes when comparing with the threshold, and now we are not allowing the compaction sstable to overlap with this. That seems wrong.
Maybe this is a consequence of the >= change in shouldSplitBefore but I don't see how.

compaction.go, line 2309 at r3 (raw file):

			// purposes of bounds computation -- the current sstable will end up with a Largest key
			// of c#7,1 so the range tombstone in the current file will be able to delete c#7.
			if d.cmp(writerMeta.LargestRangeDel.UserKey, splitKey) >= 0 {

I think this comment and code is obsolete. But we should verify that d.cmp(writerMeta.LargestRangeDel.UserKey, splitKey) <= 0.

compaction.go, line 2361 at r3 (raw file):

	// L0 splits) L0 limits, before wrapping them in an splitterGroup.
	outputSplitters := []compactionOutputSplitter{
		&fileSizeSplitter{maxFileSize: c.maxOutputFileSize},

we could also have wrapper only the fileSizeSplitter using the userKeyChangeSplitter, like we used to, yes?
Would that be "cheaper" since we would need to compare with unsafePrevUserKey only when that splitter was proposing a split?

compaction.go, line 2443 at r3 (raw file):

			// The next key either has a greater user key than the previous
			// key, or if not, the previous key must not have had a zero
			// sequence number.

Can we now remove the "or if not ..." part?

compaction.go, line 2448 at r3 (raw file):

			// grandparent's smallest key may be less than the current key.
			// Splitting at the current key will cause this output to overlap
			// a potentially unbounded number of grandparents.

I can't remember the details regarding this TODO -- can you remind me?

During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close cockroachdb#734.

jbowens

Reviewable status: 13 of 15 files reviewed, 9 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

compaction.go, line 161 at r2 (raw file):

Previously, sumeerbhola wrote…

Isn't that a deficiency of our L0 splitting scheme?

Yeah, it is.

compaction.go, line 168 at r3 (raw file):

Previously, sumeerbhola wrote…

l0LimitSplitter.onNewOutput is simple because it doesn't need to be worry about a user key spanning sstables. I was hoping this PR would make grandparentLimitSplitter be essentially identical to that except that it uses findGrandparentLimit. We could even combine the two splitters and parameterize it with a func. Am I missing something?

Nice, that's great.

compaction.go, line 871 at r3 (raw file):

Previously, sumeerbhola wrote…

why this change to use f.Smallest? Also, we've included this sstable in the overlappedBytes when comparing with the threshold, and now we are not allowing the compaction sstable to overlap with this. That seems wrong.
Maybe this is a consequence of the >= change in shouldSplitBefore but I don't see how.

Isn't the overlappedBytes behavior consistent with the description above?

// such large overlaps because they translate into large compactions. The
// current heuristic stops output of a table if the addition of another key
// would cause the table to overlap more than 10x the target file size at
// level N. See maxGrandparentOverlapBytes.

If overlapping f would increase overlappedBytes to be > c.maxOverlapBytes, then we don't want to overlap f, and we should return f.Smallest.UserKey.

The change to f.Smallest did fall out out of the >= change. When a grandparent's Largest is inclusive (not a RANGEDEL sentinel), the >= would cause us to put a key equal to the Largest in the next sstable output. That means the next output sstable would also overlap the grandparent that set the limit, unnecessarily.

The switch to using f.Smallest avoids all the inclusivity/exclusivity considerations, since f.Smallest is always inclusive, which plays well with >= limit splitting logic.

compaction.go, line 2309 at r3 (raw file):

Previously, sumeerbhola wrote…

I think this comment and code is obsolete. But we should verify that d.cmp(writerMeta.LargestRangeDel.UserKey, splitKey) <= 0.

Nice, done.

compaction.go, line 2361 at r3 (raw file):

Previously, sumeerbhola wrote…

we could also have wrapper only the fileSizeSplitter using the userKeyChangeSplitter, like we used to, yes?
Would that be "cheaper" since we would need to compare with unsafePrevUserKey only when that splitter was proposing a split?

Done

compaction.go, line 2443 at r3 (raw file):

Previously, sumeerbhola wrote…

Can we now remove the "or if not ..." part?

Done.

compaction.go, line 2448 at r3 (raw file):

Previously, sumeerbhola wrote…

I can't remember the details regarding this TODO -- can you remind me?

This is talking about a scenario like: The grandparent limit is b and key is x and there's an existing broad range tombstone in the fragmenter with bounds [a,x). The grandparent splitter requests a split before x. We fall into this case, and use x as the split boundary which creates an sstable that extends well beyond the grandparent limit of b.

I think to fix this, we could use splitKey = min(key.UserKey, splitterSuggestion).

compaction_picker.go, line 381 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Seems reasonable. Worth opening a ticket that we can track this against?

Yeah, I'll create one once this PR merges.

nicktrav

Did another pass.

Reviewed 3 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)

jbowens

Reviewable status: 12 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

compaction.go, line 2448 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

This is talking about a scenario like: The grandparent limit is b and key is x and there's an existing broad range tombstone in the fragmenter with bounds [a,x). The grandparent splitter requests a split before x. We fall into this case, and use x as the split boundary which creates an sstable that extends well beyond the grandparent limit of b.

I think to fix this, we could use splitKey = min(key.UserKey, splitterSuggestion).

Added a second commit that fixes this and simplifies this code to remove the switch.

sumeerbhola

Reviewed 1 of 3 files at r5.
Reviewable status: 13 of 15 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)

compaction.go, line 871 at r3 (raw file):

If overlapping f would increase overlappedBytes to be > c.maxOverlapBytes, then we don't want to overlap f, and we should return f.Smallest.UserKey.

If the file sizes are roughly S and the target file size is 10S, then on average we were splitting at 10.5S earlier and now we are splitting at 9.5S. I agree it doesn't matter much what we do.
But what if this file happens to be huge for whatever reason (ingested file; many seqnums for a user key which prevented a compaction from stopping at the size guideline) and is > 10S by itself. Seems safer to include it and use Largest. Also, will using Smallest potentially get us stuck at this smallest key, say k. We return k, so stop writing to the current sstable when we reach key k. Then ask for a grandparent limit with k, which again returns k.

Thanks for clarifying how this interacts with >= comparison elsewhere. Can you add a detailed code comment here -- this is very easy to miss. Unless there is a correctness issue (the "get us stuck" part above), I'm fine with this change.

compaction.go, line 2448 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Added a second commit that fixes this and simplifies this code to remove the switch.

Great. Can you add a code comment with this example to motivate why we are choosing the smaller.

compaction.go, line 179 at r5 (raw file):

		//
		// a.SET.3
		// (L0 limit at b)

nit: L0 limit is stale

If the compaction output splitters suggest a user key less than the next user key, use the the splitters' suggestion as the split key. This avoids arbitrary overlap with grandparents if the next key is far beyond the grandparent limit.

jbowens

Reviewable status: 12 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

compaction.go, line 871 at r3 (raw file):

Previously, sumeerbhola wrote…

If overlapping f would increase overlappedBytes to be > c.maxOverlapBytes, then we don't want to overlap f, and we should return f.Smallest.UserKey.

If the file sizes are roughly S and the target file size is 10S, then on average we were splitting at 10.5S earlier and now we are splitting at 9.5S. I agree it doesn't matter much what we do.
But what if this file happens to be huge for whatever reason (ingested file; many seqnums for a user key which prevented a compaction from stopping at the size guideline) and is > 10S by itself. Seems safer to include it and use Largest. Also, will using Smallest potentially get us stuck at this smallest key, say k. We return k, so stop writing to the current sstable when we reach key k. Then ask for a grandparent limit with k, which again returns k.

Thanks for clarifying how this interacts with >= comparison elsewhere. Can you add a detailed code comment here -- this is very easy to miss. Unless there is a correctness issue (the "get us stuck" part above), I'm fine with this change.

I think we're okay, because of the existing requirement that findGrandparentLimit always return a key greater than start. I believe that requirement was added to handle a similar problem of being 'stuck' but with the Largest.

I added a comment explaining the rationale for using Smallest, and also adjusted the logic enforcing that we return a key > start so that we avoid the key comparison once we've seen a file with a Smallest > start.

compaction.go, line 2448 at r3 (raw file):

Previously, sumeerbhola wrote…

Great. Can you add a code comment with this example to motivate why we are choosing the smaller.

Done.

compaction.go, line 179 at r5 (raw file):

Previously, sumeerbhola wrote…

nit: L0 limit is stale

Done.

sumeerbhola

Reviewed 1 of 1 files at r6.
Reviewable status: 13 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

jbowens

TFTRs!

Reviewable status: 13 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

Add a new format major version FormatSplitUserKeys that guarantees all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. This change adds a new format major version that can't be committed until all such atomic compaction units have been recompacted to ensure that within a level, all versions of the same user key are within the same sstable. When ratcheting the format major version to FormatSplitUserKeys, Pebble will look for atomic compaction units and schedule compactions, one-at-a-time, to rewrite these multi-file atomic compaction units into multiple single-file atomic compaction units. This migration will allow future versions of Pebble to remove code necessary to handle these atomic compaction units. Close cockroachdb#1495.

Add a new format major version FormatSplitUserKeys that guarantees all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. This change adds a new format major version that can't be committed until all such atomic compaction units have been recompacted to ensure that within levels 1-6, all versions of the same user key are within the same sstable. When ratcheting the format major version to FormatSplitUserKeys, Pebble will look for atomic compaction units and schedule compactions, one-at-a-time, to rewrite these multi-file atomic compaction units into multiple single-file atomic compaction units. This migration will allow future versions of Pebble to remove code necessary to handle these atomic compaction units. Close cockroachdb#1495.

Add a new format major version FormatSplitUserKeys that guarantees all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. This change adds a new format major version that can't be committed until all such atomic compaction units have been recompacted to ensure that within levels 1-6, all versions of the same user key are within the same sstable. When ratcheting the format major version to FormatSplitUserKeys, Pebble will look for atomic compaction units and schedule compactions, one-at-a-time, to rewrite these multi-file atomic compaction units into multiple single-file atomic compaction units. Before this new format major version is committed, Pebble will also opportunistically schedule compactions to compact these files in the background when no other compactions may be scheduled. A new field on `pebble.Metrics` exposes the count of such files, populated at Open and whenever one of these compactions is attempted. This migration will allow future versions of Pebble to remove code necessary to handle these atomic compaction units. Close cockroachdb#1495.

Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member to a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close cockroachdb#1495.

Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member of a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close cockroachdb#1495.

Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before #1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member of a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close #1495.

jbowens force-pushed the userkeysplit branch from 3551463 to 28db1fc Compare January 25, 2022 16:48

jbowens commented Jan 25, 2022

View reviewed changes

jbowens requested review from a team, itsbilal and sumeerbhola January 25, 2022 16:58

nicktrav reviewed Jan 25, 2022

View reviewed changes

jbowens force-pushed the userkeysplit branch from 28db1fc to 7268416 Compare January 25, 2022 22:34

jbowens commented Jan 25, 2022

View reviewed changes

itsbilal approved these changes Jan 26, 2022

View reviewed changes

sumeerbhola requested changes Jan 27, 2022

View reviewed changes

jbowens force-pushed the userkeysplit branch from 7268416 to 81c1f0e Compare January 31, 2022 18:04

jbowens force-pushed the userkeysplit branch from 81c1f0e to b55361d Compare January 31, 2022 18:07

jbowens commented Jan 31, 2022

View reviewed changes

nicktrav approved these changes Jan 31, 2022

View reviewed changes

nicktrav mentioned this pull request Jan 31, 2022

db: support range-key => value #1339

Closed

29 tasks

jbowens commented Jan 31, 2022

View reviewed changes

sumeerbhola requested changes Feb 1, 2022

View reviewed changes

compaction: avoid arbitrary grandparent overlap

db1144b

If the compaction output splitters suggest a user key less than the next user key, use the the splitters' suggestion as the split key. This avoids arbitrary overlap with grandparents if the next key is far beyond the grandparent limit.

jbowens force-pushed the userkeysplit branch from 3ea1111 to db1144b Compare February 2, 2022 16:56

jbowens commented Feb 2, 2022

View reviewed changes

sumeerbhola approved these changes Feb 3, 2022

View reviewed changes

jbowens commented Feb 3, 2022

View reviewed changes

jbowens merged commit 116d0b3 into cockroachdb:master Feb 3, 2022

jbowens deleted the userkeysplit branch February 3, 2022 17:56

jbowens mentioned this pull request Feb 3, 2022

db: add migration to recompact user keys split across sstables #1495

Closed

nicktrav mentioned this pull request Feb 7, 2022

github.com/cockroachdb/pebble/internal/metamorphic: TestMeta failed #1499

Closed

nicktrav mentioned this pull request Feb 15, 2022

metamorphic: certain configurations result in test timeouts #1514

Closed

jbowens mentioned this pull request Feb 25, 2022

db: add migration to rewrite atomic compaction units #1537

Merged

jbowens mentioned this pull request Mar 31, 2022

internal/metamorphic: cross version testing #1612

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compaction: don't split outputs within a user key #1470

compaction: don't split outputs within a user key #1470

jbowens commented Jan 24, 2022 •

edited

Loading

cockroach-teamcity commented Jan 24, 2022

jbowens left a comment

nicktrav left a comment

jbowens left a comment

itsbilal left a comment

sumeerbhola left a comment

jbowens left a comment

nicktrav left a comment

jbowens left a comment

sumeerbhola left a comment

jbowens left a comment

sumeerbhola left a comment

jbowens left a comment

compaction: don't split outputs within a user key #1470

compaction: don't split outputs within a user key #1470

Conversation

jbowens commented Jan 24, 2022 • edited Loading

cockroach-teamcity commented Jan 24, 2022

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens commented Jan 24, 2022 •

edited

Loading