db: support range-key => value #1339

sumeerbhola · 2021-10-18T13:36:56Z

TODO

Pebble currently has only one kind of key that is associated with a range: RANGEDEL [k1, k2)#seq, where [k1, k2) is supplied by the caller, and is used to efficiently remove a set of point keys.

The detailed design for MVCC compliant bulk operations (high-level description; detailed design draft for DeleteRange in internal doc), is running into increasing complexity by placing range operations above the Pebble layer, such that Pebble sees these as points. The complexity causes are various: (a) which key (start or end) to anchor this range on, when represented as a point (there are performance consequences), (b) rewriting on CockroachDB range splits (and concerns about rewrite volume), (c) fragmentation on writes and complexity thereof (and performance concerns for reads when not fragmenting), (d) inability to efficiently skip older MVCC versions that are masked by a [k1,k2)@ts (where ts is the MVCC timestamp).

First-class support for range keys in Pebble would eliminate all these issues. Additionally, it could allow for future extensions like efficient transactional range operations. This issue describes how this feature would work from the perspective of a user of Pebble (like CockroachDB), and sketches some implementation details.

1. Writes

There are two new operations:

Set([k1, k2), [optional suffix], <value>): This represents the mapping [k1, k2)@suffix => value.
Del([k1, k2), [optional suffix]): The delete can use a smaller key range than the original Set, in which case part of the range is deleted. The optional suffix must match the original Set.

For example, consider Set([a,d), foo) (i.e., no suffix). If there is a later call Del([b,c)), the resulting state seen by a reader is [a,b) => foo, [c,d) => foo. Note that the value is not modified when the key is fragmented.

Partially overlapping Sets overlay as expected based on fragments. For example, consider Set([a,d), foo), followed by Set([c,e), bar). The resulting state is [a,c) => foo, [c,e) => bar.
The optional suffix is related to the pebble Split operation which is explicitly documented as being for MVCC keys, without mandating exactly how the versions are represented.
Pebble internally is free to fragment the range even if there was no Del. This fragmentation is essential for preserving the performance characteristics of a log-structured merge tree.
Iteration will see fragmented state, i.e., there is no attempt to hide the physical fragmentation during iteration. Additionally, iteration may expose finer fragments than the physical fragments (discussed later). This is considered acceptable since (a) the Del semantics are over the logical range, and not a physical key, (b) our current use cases don't need to know when all the fragments for an original Set have been seen, (c) users that want to know when all the fragments have been seen can store the original k2 in the <value> and iterate until they are past that k2.
Like point deletion, the Del can be elided when it falls to L6 and there is no snapshot preventing its elision. So there is no additional garbage collection problem introduced by these keys.
Point keys and range keys do not interact in terms of overwriting one another. They have a parallel existence. The one interaction we will discuss happens at user-facing iteration time, where there can be masking behavior applied on points due to later range writes. This masking behavior is explicitly requested by the user in the context of the iteration, and does not apply to internal iterators used for compaction writes.
Point deletes only apply to points.
Range deletes apply to both point keys and range keys. A RANGEDEL can remove part of a range key, just like the new Del we introduced earlier. Note that these two delete operations are different since the Del requires that the suffix match. A RANGEDEL on the other hand is trying to remove a set of keys (historically only point keys), which can be point keys or range keys. We will discuss this in more detail below after we elaborate on the semantics of the range keys. Our current notion of RANGEDEL fails to fully consider the behavior of range keys and will need to be strengthened.
There is no Merge operaton.
Pebble will assign seqnums to the Set and Del, like it does for the existing internal keys.
- The LSM level invariants are valid across range and point keys (just like they are for RANGEDEL + point keys). That is, Set([k1,k2))#s2 cannot be at a lower level than Set(k)#s1 where k \in [k1,k2) and s1 < s2.
- These range keys will be stored in the same sstable as point keys. They will be in separate block(s).
- Range keys are expected to be rare compared to point keys. This rarity is important since bloom filters used for SeekPrefixGE cannot efficiently eliminate an sstable that contains such range keys -- we also need to read the range key block(s). Note that if these range keys are interleaved with the point keys in the sstable we would need to potentially read all the blocks to find these range keys, which is why we do not make this design choice.

A user iterating over a key interval [k1,k2) can request:

[I1] An iterator over only point keys.
[I2] A combined iterator over point + range keys. This is what we mainly discuss below.
[I3] An iterator over only range keys. In the CockroachDB use case, range keys will need to be subject to MVCC GC just like point keys -- this iterator may be useful for that purpose.

2. Key Ordering, Iteration etc.

2.1 Non-MVCC keys, i.e., no suffix

Consider the following state in terms of user keys:
point keys: a, b, c, d, e, f
range key: [a,e)

A pebble.Iterator, which shows user keys, will output (during forward iteration), the following keys:
(a,[a,e)), (b,[b,e)), (c,[c,e)), (d,[d,e)), e

The notation (b,[b,e)) indicates both these keys and their corresponding values are visible at the same time when the iterator is at that position. This can be handled by having an iterator interface like

HasPointAndRange() (hasPoint bool, hasRange bool)
Key() []byte
PointValue() []byte
RangeEndKey) []byte
RangeValue() []byte

There cannot be multiple ranges with start key at Key() since the one with the highest seqnum will win, just like for point keys.
The reason the [b,e) key in the pair (b,[b,e)) cannot be truncated to c is that the iterator at that point does not know what point key it will find next, and we do not want the cost and complexity of looking ahead (which may need to switch sstables).
If the iterator has a configured upper bound, it will truncate the range key to that upper bound. e.g. if the upper bound was c, the sequence seen would be (a,[a,c)), (b,[b,c)). The same applies to lower bound and backward iteration.

2.2 Split points for sstables

Consider the actual seqnums assigned to these keys and say we have:
point keys: a#50, b#70, b#49, b#48, c#47, d#46, e#45, f#44
range key: [a,e)#60

We have created three versions of b in this example. Pebble currently can split output sstables during a compaction such that the different b versons span more than one sstable. This creates problems for RANGEDELs which span these two sstables which are discussed in the section on improperly truncated RANGEDELS. We manage to tolerate this for RANGEDELs since their semantics are defined by the system, which is not true for these range keys where the actual semantics are up to the user.

We will disallow such sstable split points if they can span a range key. In this example, by postponing the sstable split point to the user key c, we can cleanly split the range key into [a,c)#60 and [c,e)#60. The sstable end bound for the first sstable (sstable bounds are inclusive) will be c#inf (where inf is the largest possible seqnum, which is unused except for these cases), and sstable start bound for the second sstable will be c#60.

Addendum: When we discuss keys with a suffix, i.e., MVCC keys, we discover additional difficulties (described below). The solution there is to require an ImmediateSuccessor function on the key prefix in order to enable range keys (here the prefix is the same as the user key). The example above is modified in the following way, when the sstables are split after b#49:

sstable 1: a#50, [a,b)#60, b#70, [b,ImmediateSuccessor(b))#60, b#49
sstable 2: b#48, [ImmediateSuccessor(b),e)#60, c#47, d#46, e#45, f#44
The key [b,ImmediateSuccessor(b)) behaves like a point key at b in terms of bounds, which means the end bound for the first sstable is b#49, which does not overlap with the start of the second sstable at b#48.

So we have two workable alternative solutions here. It is possible we will choose to not let the same key span sstables since it simplifies some existing code in Pebble too.

2.3 Determinism of output

Range keys will be split based on boundaries of sstables in an LSM. We typically expect that two different LSMs with different sstable settings that receive the same writes should output the same key-value pairs when iterating. To provide this behavior, the iterator implementation could defragment range keys during iteration time. The defragmentation behavior would be:

Two visible ranges [k1,k2)=>val1, [k2,k3)=>val2 are defragmented if val1==val2, and become [k1,k3).
Defragmentation does not consider the sequence number. This is necessary since LSM state can be exported to another LSM via the use of sstable ingestion, which can collapse different seqnums to the same seqnum. We would like both LSMs to look identical to the user when iterating.

The above defragmentation is conceptually simple, but hard to implement efficiently, since it requires stepping ahead from the current position to defragment range keys. This stepping ahead could swich sstables while there are still points to be consumed in a previous sstable. We note the application correctness cannot be relying on determinism of fragments, and that this determinism is useful only for testing and verification purposes:

Randomized tests that encounter non-determinism in fragmentation can buffer all the fragments in-memory, defragment them, and then compare these defragmented range keys for equality testing. This behavior would need to be applied to Pebble's metamorphic test, for which a single Next/Prev call would turn into as many calls are needed until the next point is encountered. All the buffered fragments from these many calls would be defragmented and compared for equality.
CockroachDB's replica divergence detector can iterate over only the range keys using iterator I3, defragment in a streaming manner, since it only requires keeping a start/end key pair in-memory, and compute the fingerprint after defragmenting. When we consider MVCC keys, we will see that the defragmentation may actually need to keep multiple start/end key pairs in-memory since there may be different versions over overlapping key spans -- we do not expect that the number of versions of range keys that overlap will be significant enough for this defragmentation to suffer from high memory usage.

In short, we will not guarantee determinism of output.

TODO: think more about whether there is an efficient way to offer determinism.

2.4 MVCC keys, i.e., may have suffix

As a reminder, we will have range writes of the form Set([k1, k2), <ver>, <value>).

The requirement is that running split on k1+<ver> and k2+<ver> will give us prefix k1 and k2 respectively (+ is simply concatenation).
Existing point writes would have been of the form k+<ver> => <value> (in some cases <ver> is empty for these point writes).
Pebble already requires that the prefix k follows the same key validity rules as k+<ver>.
Even though Pebble does not require that k < k+<ver> (when <ver> is non-empty), CockroachDB has adopted this behavior since it results in the following clean behavior: RANGEDEL over [k1, k2) deletes all versioned keys which have prefixes in the interval [k1, k2). Pebble will now require this behavior for anyone using MVCC keys.

Consider the following state in terms of user keys (the @Number part is the version)
point keys: a@100, a@30, b@100, b@40, b@30, c@40, d@40, e@30
range key: [a,e)@50

If we consider the user key ordering across the union of these keys, where we have used ' and '' to mark the start and end keys for the range key, we get:
a@100, a@50', a@30, b@100, b@40, b@30, c@40, d@40, e@50'', e@30

A pebble.Iterator, which shows user keys, will output (during forward iteration), the following keys:
a@100, [a,e)@50, a@30, b@100, [b,e)@50, b@40, b@30, [c,e)@50, c@40, [d,e)@50, d@40, e@30

In this example each Next will be positioned such that there is only either a point key or a range key, but not both. The same interface outlined earlier can of course handle this more restrictive example.
Like the non-MVCC case, the end key for the range is not being fragmented using the succeeding point key.
As a generalization of what we described earlier, the start key of the range is being adjusted such that the prefix matches the prefix of the succeeding point key, which this "masks" from an MVCC perspective. This repetition of parts of the original [a,e)@50, in the context of the latest point key, is useful to a caller that is making decisions on how to interpret the latest prefix and its various versions. The same repetition will happen during reverse iteration.

2.5 Splitting MVCC keys across sstables

Splitting of an MVCC range key is always done at the key prefix boundary.

[a,e)@50#s can be split into [a,c)@50#s, [c,e)@50#s, and so on.
The obvious thing would be for the first key to contribute c@50#inf to the first sstable's end bound, and the second key to contribute c@50#s to the second sstable's start bound.

Consider the earlier example and let us allow the same prefix b to span multiple sstables. The two sstables would have the following user keys:

first sstable: points: a@100, a@30, b@100, b@40 ranges: [a,c)@50
second sstable: points: b@30, c@40, d@40, e@30, ranges: [c,e)@50.

In practice the compaction code would need to know that the point key after b@30 has key prefix c, in order to fragment the range into [a,c)@50, for inclusion in the first sstable. This is messy wrt code, so a prerequisite for using range keys is to define an ImmediateSuccesor function on the key prefix. The compaction code can then include [a,ImmediateSuccessor(b))@50 in the first sstable and [ImmediateSuccessor(b),e)@50 in the second sstable. But this raises another problem: the end bound of the first sstable is ImmediateSuccessor(b)@50#inf and the start bound of the second sstable is b@30#s, where s is some seqnum. Such overlapping bounds are not permitted.

We consider two high-level options to address this problem.

Not have a key prefix span multiple files: This would disallow splitting after b@40, since b@30 remains. Requiring all point versions for an MVCC key to be in one sstable is dangerous given that histories can be maintained for a long time. This would also potentially increase write amplification. So we do not consider this further.
Use the ImmediateSuccessor to make range keys behave as points, when necessary: Since the prefix b is spanning multiple sstables in this example, the range key would get split into [a,b)@50, [b,ImmediateSuccessor(b))@50, [ImmediateSuccessor(b),e)@50.
- first sstable: points: a@100, a@30, b@100, b@40 ranges: [a,b)@50, [b,ImmediateSuccessor(b))@50
- second sstable: points: b@30, c@40, d@40, e@30, ranges: [ImmediateSuccessor(b),e)@50
- The sstable end bounds contribution for the two range keys in the first sstable are b#inf, b@50#s. This is justified as follows:
  - The b prefix is not included in [a,b)@50 and we know that b sorts before b@ver (for non-empty ver).
  - [b,ImmediateSuccessor(b))@50 is effectively representing a point b@50.
    Hence the end bound for the first sstable is b@40#s1, which does not overlap with the b@30#s2 start of the second sstable.
If the second sstable starts on a new point key prefix we would use that prefix for splitting the range keys in the first sstable.

We adopt the second option. However, now it is possible for b@100#s2 to be at a lower LSM level than [a,e)@50#s1 where s1<s2. This is because the LSM invariant only works over user keys, and not over user key prefixes, and we have deliberately chosen this behavior.

Truncation of the range keys using the configured upper bound of the iterator can only be done if the configured upper bound is its own prefix. That is, we will not permit such iterators to have an upper bound like d@200. It has to be a prefix like d. This is to ensure the truncation is sound -- we are not permitting these range keys to have a different suffix for the start and end keys. If the user desires a tighter upper bound than what this rule permits, they can do the bounds checking themselves.

2.6 Strengthening RANGEDEL semantics

The current RANGEDEL behavior is defined over points and is oblivious to the prefix/suffix behavior of MVCC keys. The common case is a RANGEDEL of the form [b,d), i.e., no suffix. Such a RANGEDEL can be cleanly applied to range keys since it is not limited to a subset of versions. However, currently one can also construct RANGEDELs like [b@30,d@90), which are acceptable for bulk deletion of points, but semantically hard to handle when applied against a range key [a,e)@50. To deal with this we will keep the current RANGEDEL to apply to only point keys, and introduce RANGEDEL_PREFIX, which must have start and end keys that are equal to their prefix.

2.7 Efficient masking of old MVCC versions

Recollect that in the earlier example (before we discussed splitting) the pebble.Iterator would output (during forward iteration), the following keys:
a@100, [a,e)@50, a@30, b@100, [b,e)@50, b@40, b@30, [c,e)@50, c@40, [d,e)@50, d@40, e@30

It would be desirable if the caller could indicate a desire to skip over a@30, b@40, b@30, c@40, d@40, and this could be done efficiently. We will support iteration in this MVCC-masked mode, when specified as an iterator option. This iterator option would also need to specify the version such that the caller wants all values >= version.

To efficiently implement this we cannot rely on the LSM invariant since b@100 can be at a lower level than [a,e)@50. However, we do have (a) per-block key bounds, and (b) for time-bound iteration purposes we will be maintaining block properties for the timestamp range (in CockroachDB). We will require that for efficient masking, the user provide the name of the block property to utilize. In this example, when the iterator will encounter [a,e)@100 it can skip blocks whose keys are guaranteed to be in [a,e) and whose timestamp interval is < 100.

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2021-10-18T16:20:31Z

Great writeup! Some quick comments after a first pass.

Point keys and range keys do not interact in terms of overwriting one another. They have a parallel existence.

It'd be worth pointing out that a Get() for a point key is not affected by range keys.

Two visible ranges [k1,k2)=>val1, [k2,k3)=>val2 are defragmented if val1==val2, and become [k1,k3).

This would have the opposite problem of two unrelated range keys suddenly merging if they happened to be written next to each other e.g. following a compaction.

TODO: think more about whether there is an efficient way to offer determinism.

Could we store the original [start,end) keys in each fragment, to be able to reconstruct the original range key while still taking into account overlapping writes?

jbowens · 2021-10-18T20:28:19Z

Thanks for this great writeup! I haven't internalized the handling of sstable boundaries yet. I'll add more comments later this afternoon.

Note that if these range keys are interleaved with the point keys in the sstable we would need to potentially read all the blocks to find these range keys, which is why we do not make this design choice.

We could use a sstable property indicating whether a table has range keys to allow for coarse sstable-level decisions of whether to allow bloom filter matching or not, right? Not advocating for interleaving though.

There cannot be multiple ranges with start key at Key() since the one with the highest seqnum will win, just like for point keys.

If a user (like CockroachDB) wanted to implement multiple kinds of range operations, for example something in addition to MVCC Delete Range, is the intention that they would encode that within the value?

The reason the [b,e) key in the pair (b,[b,e)) cannot be truncated to c is that the iterator at that point does not know what point key it will find next, and we do not want the cost and complexity of looking ahead (which may need to switch sstables).

These ranges are truncated to sstable bounds though, right?

We note the application correctness cannot be relying on determinism of fragments, and that this determinism is useful only for testing and verification purposes

De-fragmentation, if it could be done efficiently, would help performance for masking old versions too, right? The iterator would be able to know that it needs to mask versions to the more distant end key.

jbowens · 2021-10-18T20:59:55Z

second sstable: points: b@30, c@40, d@40, e@30, ranges: [ImmediateSuccessor(b),e)@50

This sstable contains the point b@30 but its range doesn't begin until ImmediateSuccessor(b). Isn't that a problem because b@30 appears to be uncovered?

jbowens · 2021-10-18T21:45:14Z

This sstable contains the point b@30 but its range doesn't begin until ImmediateSuccessor(b). Isn't that a problem because b@30 appears to be uncovered?

Should range end boundaries also be allowed to have suffixes? The sstable on the right side would need an extra fragment that only covers the remainder of b from timestamp 30 down.

first sstable: points: a@100, a@30, b@100, b@40 ranges: [a@50,b@30)
second sstable: points: b@30, c@40, d@40, e@30, ranges: [b@30,c), [c@50,e)

sumeerbhola · 2021-10-19T01:55:24Z

Responding to some of the comments here. We can start addressing the rest in the doc PR.

TODO: think more about whether there is an efficient way to offer determinism.

Could we store the original [start,end) keys in each fragment, to be able to reconstruct the original range key while still taking into account overlapping writes?

Yes. I think that is what we will need to handle MVCCStats about these ranged keys. The rough idea I had was to keep only the end key, so that when we recompute stats on the split by reading the LHS we can decide whether a ranged key that we read for the left also extends into the RHS. But this isn't quite sufficient since we can't just add up the stats at merge time if we've gone to the trouble of de-duping the ranged keys before. We could also GC at different times, so say we've GC'd from the RHS and then merge the CockroachDB ranges again. And then split again. Now the ranged key is gone from the RHS but the value still says it extends into the RHS, which is incorrect. I am still searching for good ideas here.

We could use a sstable property indicating whether a table has range keys to allow for coarse sstable-level decisions of whether to allow bloom filter matching or not, right? Not advocating for interleaving though.

Yes. Though it would be nice if we didn't suffer performance a huge performance drop if every sstable had one range key.

If a user (like CockroachDB) wanted to implement multiple kinds of range operations, for example something in addition to MVCC Delete Range, is the intention that they would encode that within the value?

It would be the user key part of the key probably. Just like we reserve parts of the key space for MVCC writes and another part for locks over those MVCC writes. The CockroachDB DeleteRange is an MVCC write, so belongs in the MVCC key space of CockroachDB.

These ranges are truncated to sstable bounds though, right?

Yes, they would be.

De-fragmentation, if it could be done efficiently, would help performance for masking old versions too, right? The iterator would be able to know that it needs to mask versions to the more distant end key.

I don't know how much it will matter. It is possible that we've read a block in an sstable (at a different level) that is not at the top of the mergingIter heap (the same thing can occur when we do the seek optimization in mergingIter). Within a level the fragment in an sstable should correspond to the sstable bounds so should be able to omit blocks within the sstable.

This sstable contains the point b@30 but its range doesn't begin until ImmediateSuccessor(b). Isn't that a problem because b@30 appears to be uncovered?

That is a very good observation and I'd been debating whether it mattered before writing this up. In some sense at the MVCC level we have a similar problem to what RANGEDEL has at the user key level. But it may not be quite the same. Seeking to b@30, and not seeing [b, ImmediateSuccessor(b))@50, because the read is at < @50 is ok. And if it is doing a read >= @50, it will see [b, ImmediateSuccessor(b))@50 first. When operating in masking mode we will need to remember the masking ranges from a preceding sstable -- we can't rely on an sstable to be self contained. I haven't thought through all the implications on the iterator implementation.

Should range end boundaries also be allowed to have suffixes? The sstable on the right side would need an extra fragment that only covers the remainder of b from timestamp 30 down.
first sstable: points: a@100, a@30, b@100, b@40 ranges: [a@50,b@30)
second sstable: points: b@30, c@40, d@40, e@30, ranges: [b@30,c), [c@50,e)

There are difficulties here. For example [b@30,c). How does it represent that bb@49 is masked. I think if we wanted to not have to look at the first sstable to decide what is masked in the second sstable, we could include [b,ImmediatedSuccessor(b))@50 in the first sstable and also include [b,ImmediateSuccessor(b))@39 in the second sstable. It won't fully work because what if the point key in the second sstable is b@39. There is no version gap between 40 and 39 that we can use for a ranged key < 40 and > 39.
But there is probably something here to give more thought to.

Informs cockroachdb#1339

Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.

Range keys (see #1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to #1339.

The `sstable.Layout` struct contains information pertaining to the layout of an sstable. Add the range key block to the layout. Related to cockroachdb#1339.

The `sstable.Layout` struct contains information pertaining to the layout of an sstable. Add the range key block to the layout. Related to #1339.

During a compaction, if the current sstable is hit the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries.

During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close cockroachdb#734.

During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see #1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close #734.

As part of the MVCC Bulk Operations project, [Pebble range keys][1] will need to be enabled on _all_ engines of a cluster before nodes can start using the feature to read and write SSTables that contain the range key features (a backward-incompatible change). Adding a cluster version is necessary, but not sufficient in guaranteeing that nodes are ready to generate, but importantly _receive_ SSTabbles with range key support. Specifically, there exists a race condition where nodes are told to update their engines as part of the existing Pebble major format update process, but there is no coordination between the nodes. One node (a sender) may complete its engine upgrade and enable the new SSTable features _before_ another node (the receiver). The latter will panic on receipt of an SSTable with the newer features written by the former. Add an server RPC endpoint that provides a means of waiting on a node to update its store to a version that is compatible with a cluster version. This endpoint is used as part of a system migration to ensure that all nodes in a cluster are running with an engine version that is at least compatible with a given cluster version. Expose the table format major version on `storage.Engine`. This will be used elsewhere in Cockroach (for example, SSTable generation for ingest and backup). Add a `WaitForCompatibleEngineVersion` function on the `storage.Engine` interface that provides a mechanism to block until an engine is running at a format major version that compatible with a given cluster version. Expose the engine format major version as a `storage.TestingKnob` to allow tests to alter the Pebble format major version. Add a new cluster version to coordinate the upgrade of all engines in a cluster to `pebble.FormatBlockPropertyCollector` (Pebble,v1), and the system migration required for coordinating the engine upgrade to the latest Pebble table format version. This patch also fixes an existing issue where a node may write SSTables with block properties as part of a backup that are then ingested by an older node. This patch provides the infrastructure necessary for making these "cluster-external" SSTable operations engine-aware. Nodes should only use a table format version that other nodes in the cluster understand. Informs cockroachdb/pebble#1339. [1]: cockroachdb/pebble#1339 Release note: None

Introduce a new format major version for range keys, with associated table format `Pebble,v2`. When the DB is opened at this version, it is free to make use of range key features. Related to cockroachdb#1339.

As part of the MVCC Bulk Operations project, [Pebble range keys][1] will need to be enabled on _all_ engines of a cluster before nodes can start using the feature to read and write SSTables that contain the range key features (a backward-incompatible change). Adding a cluster version is necessary, but not sufficient in guaranteeing that nodes are ready to generate, but importantly _receive_ SSTabbles with range key support. Specifically, there exists a race condition where nodes are told to update their engines as part of the existing Pebble major format update process, but there is no coordination between the nodes. One node (a sender) may complete its engine upgrade and enable the new SSTable features _before_ another node (the receiver). The latter will panic on receipt of an SSTable with the newer features written by the former. Add an server RPC endpoint that provides a means of waiting on a node to update its store to a version that is compatible with a cluster version. This endpoint is used as part of a system migration to ensure that all nodes in a cluster are running with an engine version that is at least compatible with a given cluster version. Expose the table format major version on `storage.Engine`. This will be used elsewhere in Cockroach (for example, SSTable generation for ingest and backup). Add a `WaitForCompatibleEngineVersion` function on the `storage.Engine` interface that provides a mechanism to block until an engine is running at a format major version that compatible with a given cluster version. Expose the engine format major version as a `storage.TestingKnob` to allow tests to alter the Pebble format major version. Add a new cluster version to coordinate the upgrade of all engines in a cluster to `pebble.FormatBlockPropertyCollector` (Pebble,v1), and the system migration required for coordinating the engine upgrade to the latest Pebble table format version. This patch also fixes an existing issue where a node may write SSTables with block properties as part of a backup that are then ingested by an older node. This patch provides the infrastructure necessary for making these "cluster-external" SSTable operations engine-aware. Nodes should only use a table format version that other nodes in the cluster understand. Informs cockroachdb/pebble#1339. [1]: cockroachdb/pebble#1339 Release note: None

Introduce a new format major version for range keys, with associated table format `Pebble,v2`. When the DB is opened at this version, it is free to make use of range key features. Add various validations and assertions based on the new format major version. For example, if the DB version does not support range keys, ingesting tables with range keys will fail, etc. Related to cockroachdb#1339.

Introduce a new format major version for range keys, with associated table format `Pebble,v2`. When the DB is opened at this version, it is free to make use of range key features. Add various validations and assertions based on the new format major version. For example, if the DB version does not support range keys, ingesting tables with range keys will fail, etc. Related to #1339.

jbowens · 2022-08-05T23:04:00Z

Thoughts on closing this out as 'feature complete'? I think the remainder of related work is tracked elsewhere.

nicktrav · 2022-08-08T14:09:16Z

Let's close it.

Describe design of range keys. Informs cockroachdb#1339 Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>

Describe design of range keys. Informs #1339 Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>

erikgrinaker mentioned this issue Oct 18, 2021

storage: MVCC range deletion tombstones cockroachdb/cockroach#70412

Closed

6 tasks

ajwerner mentioned this issue Oct 18, 2021

kv: implement ranged locks for unreplicated locks cockroachdb/cockroach#55896

Closed

erikgrinaker mentioned this issue Oct 18, 2021

*: avoid non-MVCC operations cockroachdb/cockroach#70429

Closed

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Oct 19, 2021

docs: draft design sketch for ranged keys

45ddbb1

Informs cockroachdb#1339

sumeerbhola mentioned this issue Oct 19, 2021

docs: draft design sketch for ranged keys #1341

Merged

sumeerbhola mentioned this issue Nov 9, 2021

storage, kv: do best-effort GC of old versions during storage compactions cockroachdb/cockroach#57260

Open

nicktrav mentioned this issue Dec 8, 2021

sstable: add meta block for range keys; support writing range keys #1400

Merged

jbowens mentioned this issue Dec 13, 2021

internal/metamorphic: generate suffixed keys #1408

Closed

nicktrav mentioned this issue Dec 14, 2021

db: introduce Pebble-specific sstable versioning scheme #1409

Closed

nicktrav added a commit to nicktrav/pebble that referenced this issue Dec 14, 2021

sstable: add range keys to sstable.Layout

374261d

The `sstable.Layout` struct contains information pertaining to the layout of an sstable. Add the range key block to the layout. Related to cockroachdb#1339.

nicktrav mentioned this issue Dec 14, 2021

sstable: add range keys to sstable.Layout #1410

Merged

ajwerner mentioned this issue Dec 14, 2021

kvserver: maintain key-span GC thresholds cockroachdb/cockroach#73826

Closed

nicktrav added a commit that referenced this issue Dec 16, 2021

sstable: add range keys to sstable.Layout

bcd4c82

The `sstable.Layout` struct contains information pertaining to the layout of an sstable. Add the range key block to the layout. Related to #1339.

jbowens mentioned this issue Jan 24, 2022

compaction: don't split outputs within a user key #1470

Merged

nicktrav mentioned this issue Feb 4, 2022

pkg/migration: support waiting for minimum engine version cockroachdb/cockroach#76085

Closed

nicktrav mentioned this issue Feb 4, 2022

db: add format major version for range keys #1497

Merged

jbowens mentioned this issue May 6, 2022

compaction: integrate range keys #1686

Closed

nicktrav closed this as completed Aug 8, 2022

jbowens added a commit to sumeerbhola/pebble that referenced this issue Sep 9, 2022

docs: draft design sketch for ranged keys

ade6a02

Describe design of range keys. Informs cockroachdb#1339 Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>

jbowens added a commit to sumeerbhola/pebble that referenced this issue Sep 13, 2022

docs: draft design sketch for ranged keys

d4c52b2

Describe design of range keys. Informs cockroachdb#1339 Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>

jbowens added a commit that referenced this issue Sep 13, 2022

docs: draft design sketch for ranged keys

f079ef5

Describe design of range keys. Informs #1339 Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: support range-key => value #1339

db: support range-key => value #1339

sumeerbhola commented Oct 18, 2021 •

edited by jbowens

Loading

erikgrinaker commented Oct 18, 2021

jbowens commented Oct 18, 2021

jbowens commented Oct 18, 2021

jbowens commented Oct 18, 2021

sumeerbhola commented Oct 19, 2021

jbowens commented Aug 5, 2022

nicktrav commented Aug 8, 2022

db: support range-key => value #1339

db: support range-key => value #1339

Comments

sumeerbhola commented Oct 18, 2021 • edited by jbowens Loading

1. Writes

2. Key Ordering, Iteration etc.

2.1 Non-MVCC keys, i.e., no suffix

2.2 Split points for sstables

2.3 Determinism of output

2.4 MVCC keys, i.e., may have suffix

2.5 Splitting MVCC keys across sstables

2.6 Strengthening RANGEDEL semantics

2.7 Efficient masking of old MVCC versions

erikgrinaker commented Oct 18, 2021

jbowens commented Oct 18, 2021

jbowens commented Oct 18, 2021

jbowens commented Oct 18, 2021

sumeerbhola commented Oct 19, 2021

jbowens commented Aug 5, 2022

nicktrav commented Aug 8, 2022

sumeerbhola commented Oct 18, 2021 •

edited by jbowens

Loading