-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: support range-key => value #1339
Comments
Great writeup! Some quick comments after a first pass.
It'd be worth pointing out that a
This would have the opposite problem of two unrelated range keys suddenly merging if they happened to be written next to each other e.g. following a compaction.
Could we store the original |
Thanks for this great writeup! I haven't internalized the handling of sstable boundaries yet. I'll add more comments later this afternoon.
We could use a sstable property indicating whether a table has range keys to allow for coarse sstable-level decisions of whether to allow bloom filter matching or not, right? Not advocating for interleaving though.
If a user (like CockroachDB) wanted to implement multiple kinds of range operations, for example something in addition to MVCC Delete Range, is the intention that they would encode that within the value?
These ranges are truncated to sstable bounds though, right?
De-fragmentation, if it could be done efficiently, would help performance for masking old versions too, right? The iterator would be able to know that it needs to mask versions to the more distant end key. |
This sstable contains the point |
Should range end boundaries also be allowed to have suffixes? The sstable on the right side would need an extra fragment that only covers the remainder of first sstable: points: a@100, a@30, b@100, b@40 ranges: |
Responding to some of the comments here. We can start addressing the rest in the doc PR.
Yes. I think that is what we will need to handle
Yes. Though it would be nice if we didn't suffer performance a huge performance drop if every sstable had one range key.
It would be the user key part of the key probably. Just like we reserve parts of the key space for MVCC writes and another part for locks over those MVCC writes. The CockroachDB DeleteRange is an MVCC write, so belongs in the MVCC key space of CockroachDB.
Yes, they would be.
I don't know how much it will matter. It is possible that we've read a block in an sstable (at a different level) that is not at the top of the mergingIter heap (the same thing can occur when we do the seek optimization in mergingIter). Within a level the fragment in an sstable should correspond to the sstable bounds so should be able to omit blocks within the sstable.
That is a very good observation and I'd been debating whether it mattered before writing this up. In some sense at the MVCC level we have a similar problem to what RANGEDEL has at the user key level. But it may not be quite the same. Seeking to b@30, and not seeing [b, ImmediateSuccessor(b))@50, because the read is at < @50 is ok. And if it is doing a read >= @50, it will see [b, ImmediateSuccessor(b))@50 first. When operating in masking mode we will need to remember the masking ranges from a preceding sstable -- we can't rely on an sstable to be self contained. I haven't thought through all the implications on the iterator implementation.
There are difficulties here. For example [b@30,c). How does it represent that bb@49 is masked. I think if we wanted to not have to look at the first sstable to decide what is masked in the second sstable, we could include [b,ImmediatedSuccessor(b))@50 in the first sstable and also include [b,ImmediateSuccessor(b))@39 in the second sstable. It won't fully work because what if the point key in the second sstable is b@39. There is no version gap between 40 and 39 that we can use for a ranged key < 40 and > 39. |
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
Range keys (see #1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to #1339.
The `sstable.Layout` struct contains information pertaining to the layout of an sstable. Add the range key block to the layout. Related to cockroachdb#1339.
The `sstable.Layout` struct contains information pertaining to the layout of an sstable. Add the range key block to the layout. Related to #1339.
During a compaction, if the current sstable is hit the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries.
During a compaction, if the current sstable is hit the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries.
During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close cockroachdb#734.
During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close cockroachdb#734.
During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see #1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close #734.
As part of the MVCC Bulk Operations project, [Pebble range keys][1] will need to be enabled on _all_ engines of a cluster before nodes can start using the feature to read and write SSTables that contain the range key features (a backward-incompatible change). Adding a cluster version is necessary, but not sufficient in guaranteeing that nodes are ready to generate, but importantly _receive_ SSTabbles with range key support. Specifically, there exists a race condition where nodes are told to update their engines as part of the existing Pebble major format update process, but there is no coordination between the nodes. One node (a sender) may complete its engine upgrade and enable the new SSTable features _before_ another node (the receiver). The latter will panic on receipt of an SSTable with the newer features written by the former. Add an server RPC endpoint that provides a means of waiting on a node to update its store to a version that is compatible with a cluster version. This endpoint is used as part of a system migration to ensure that all nodes in a cluster are running with an engine version that is at least compatible with a given cluster version. Expose the table format major version on `storage.Engine`. This will be used elsewhere in Cockroach (for example, SSTable generation for ingest and backup). Add a `WaitForCompatibleEngineVersion` function on the `storage.Engine` interface that provides a mechanism to block until an engine is running at a format major version that compatible with a given cluster version. Expose the engine format major version as a `storage.TestingKnob` to allow tests to alter the Pebble format major version. Add a new cluster version to coordinate the upgrade of all engines in a cluster to `pebble.FormatBlockPropertyCollector` (Pebble,v1), and the system migration required for coordinating the engine upgrade to the latest Pebble table format version. This patch also fixes an existing issue where a node may write SSTables with block properties as part of a backup that are then ingested by an older node. This patch provides the infrastructure necessary for making these "cluster-external" SSTable operations engine-aware. Nodes should only use a table format version that other nodes in the cluster understand. Informs cockroachdb/pebble#1339. [1]: cockroachdb/pebble#1339 Release note: None
Introduce a new format major version for range keys, with associated table format `Pebble,v2`. When the DB is opened at this version, it is free to make use of range key features. Related to cockroachdb#1339.
As part of the MVCC Bulk Operations project, [Pebble range keys][1] will need to be enabled on _all_ engines of a cluster before nodes can start using the feature to read and write SSTables that contain the range key features (a backward-incompatible change). Adding a cluster version is necessary, but not sufficient in guaranteeing that nodes are ready to generate, but importantly _receive_ SSTabbles with range key support. Specifically, there exists a race condition where nodes are told to update their engines as part of the existing Pebble major format update process, but there is no coordination between the nodes. One node (a sender) may complete its engine upgrade and enable the new SSTable features _before_ another node (the receiver). The latter will panic on receipt of an SSTable with the newer features written by the former. Add an server RPC endpoint that provides a means of waiting on a node to update its store to a version that is compatible with a cluster version. This endpoint is used as part of a system migration to ensure that all nodes in a cluster are running with an engine version that is at least compatible with a given cluster version. Expose the table format major version on `storage.Engine`. This will be used elsewhere in Cockroach (for example, SSTable generation for ingest and backup). Add a `WaitForCompatibleEngineVersion` function on the `storage.Engine` interface that provides a mechanism to block until an engine is running at a format major version that compatible with a given cluster version. Expose the engine format major version as a `storage.TestingKnob` to allow tests to alter the Pebble format major version. Add a new cluster version to coordinate the upgrade of all engines in a cluster to `pebble.FormatBlockPropertyCollector` (Pebble,v1), and the system migration required for coordinating the engine upgrade to the latest Pebble table format version. This patch also fixes an existing issue where a node may write SSTables with block properties as part of a backup that are then ingested by an older node. This patch provides the infrastructure necessary for making these "cluster-external" SSTable operations engine-aware. Nodes should only use a table format version that other nodes in the cluster understand. Informs cockroachdb/pebble#1339. [1]: cockroachdb/pebble#1339 Release note: None
Introduce a new format major version for range keys, with associated table format `Pebble,v2`. When the DB is opened at this version, it is free to make use of range key features. Add various validations and assertions based on the new format major version. For example, if the DB version does not support range keys, ingesting tables with range keys will fail, etc. Related to cockroachdb#1339.
Introduce a new format major version for range keys, with associated table format `Pebble,v2`. When the DB is opened at this version, it is free to make use of range key features. Add various validations and assertions based on the new format major version. For example, if the DB version does not support range keys, ingesting tables with range keys will fail, etc. Related to cockroachdb#1339.
Introduce a new format major version for range keys, with associated table format `Pebble,v2`. When the DB is opened at this version, it is free to make use of range key features. Add various validations and assertions based on the new format major version. For example, if the DB version does not support range keys, ingesting tables with range keys will fail, etc. Related to #1339.
Thoughts on closing this out as 'feature complete'? I think the remainder of related work is tracked elsewhere. |
Let's close it. |
Describe design of range keys. Informs cockroachdb#1339 Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Describe design of range keys. Informs cockroachdb#1339 Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Describe design of range keys. Informs #1339 Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
TODO
pebble.Batch
interface batch: add RangeKey{Set,Unset,Delete} methods #1386internal/rangedel
package into a genercizedinternal/keyspan
package internal/keyspan: add Value to Span #1382, db: genericize rangeDel{Cache,Frags} into keySpan{Cache,Frags} #1356, internal/keyspan: rename from rangedel #1352*FileMetadata
and MANIFEST.(*sstable.Reader).EstimateDiskUsage
to take into account overlapping range key spans.sstable_writer.go
(in Cockroach) to be aware of cluster version (so they can infer pebble format major version, and hence table format version)String()
methods for displaying table boundsPebble currently has only one kind of key that is associated with a range:
RANGEDEL [k1, k2)#seq
, where [k1, k2) is supplied by the caller, and is used to efficiently remove a set of point keys.The detailed design for MVCC compliant bulk operations (high-level description; detailed design draft for DeleteRange in internal doc), is running into increasing complexity by placing range operations above the Pebble layer, such that Pebble sees these as points. The complexity causes are various: (a) which key (start or end) to anchor this range on, when represented as a point (there are performance consequences), (b) rewriting on CockroachDB range splits (and concerns about rewrite volume), (c) fragmentation on writes and complexity thereof (and performance concerns for reads when not fragmenting), (d) inability to efficiently skip older MVCC versions that are masked by a
[k1,k2)@ts
(where ts is the MVCC timestamp).First-class support for range keys in Pebble would eliminate all these issues. Additionally, it could allow for future extensions like efficient transactional range operations. This issue describes how this feature would work from the perspective of a user of Pebble (like CockroachDB), and sketches some implementation details.
1. Writes
There are two new operations:
Set([k1, k2), [optional suffix], <value>)
: This represents the mapping[k1, k2)@suffix => value
.Del([k1, k2), [optional suffix])
: The delete can use a smaller key range than the original Set, in which case part of the range is deleted. The optional suffix must match the original Set.For example, consider
Set([a,d), foo)
(i.e., no suffix). If there is a later callDel([b,c))
, the resulting state seen by a reader is[a,b) => foo
,[c,d) => foo
. Note that the value is not modified when the key is fragmented.Set([a,d), foo)
, followed bySet([c,e), bar)
. The resulting state is[a,c) => foo
,[c,e) => bar
.<value>
and iterate until they are past that k2.Set([k1,k2))#s2
cannot be at a lower level thanSet(k)#s1
wherek \in [k1,k2)
ands1 < s2
.A user iterating over a key interval [k1,k2) can request:
2. Key Ordering, Iteration etc.
2.1 Non-MVCC keys, i.e., no suffix
Consider the following state in terms of user keys:
point keys: a, b, c, d, e, f
range key: [a,e)
A pebble.Iterator, which shows user keys, will output (during forward iteration), the following keys:
(a,[a,e)), (b,[b,e)), (c,[c,e)), (d,[d,e)), e
The notation (b,[b,e)) indicates both these keys and their corresponding values are visible at the same time when the iterator is at that position. This can be handled by having an iterator interface like
Key()
since the one with the highest seqnum will win, just like for point keys.2.2 Split points for sstables
Consider the actual seqnums assigned to these keys and say we have:
point keys: a#50, b#70, b#49, b#48, c#47, d#46, e#45, f#44
range key:
[a,e)#60
We have created three versions of b in this example. Pebble currently can split output sstables during a compaction such that the different b versons span more than one sstable. This creates problems for RANGEDELs which span these two sstables which are discussed in the section on improperly truncated RANGEDELS. We manage to tolerate this for RANGEDELs since their semantics are defined by the system, which is not true for these range keys where the actual semantics are up to the user.
We will disallow such sstable split points if they can span a range key. In this example, by postponing the sstable split point to the user key c, we can cleanly split the range key into
[a,c)#60
and[c,e)#60
. The sstable end bound for the first sstable (sstable bounds are inclusive) will be c#inf (where inf is the largest possible seqnum, which is unused except for these cases), and sstable start bound for the second sstable will be c#60.Addendum: When we discuss keys with a suffix, i.e., MVCC keys, we discover additional difficulties (described below). The solution there is to require an
ImmediateSuccessor
function on the key prefix in order to enable range keys (here the prefix is the same as the user key). The example above is modified in the following way, when the sstables are split after b#49:[a,b)#60
, b#70,[b,ImmediateSuccessor(b))#60
, b#49[ImmediateSuccessor(b),e)#60
, c#47, d#46, e#45, f#44The key [b,ImmediateSuccessor(b)) behaves like a point key at b in terms of bounds, which means the end bound for the first sstable is b#49, which does not overlap with the start of the second sstable at b#48.
So we have two workable alternative solutions here. It is possible we will choose to not let the same key span sstables since it simplifies some existing code in Pebble too.
2.3 Determinism of output
Range keys will be split based on boundaries of sstables in an LSM. We typically expect that two different LSMs with different sstable settings that receive the same writes should output the same key-value pairs when iterating. To provide this behavior, the iterator implementation could defragment range keys during iteration time. The defragmentation behavior would be:
The above defragmentation is conceptually simple, but hard to implement efficiently, since it requires stepping ahead from the current position to defragment range keys. This stepping ahead could swich sstables while there are still points to be consumed in a previous sstable. We note the application correctness cannot be relying on determinism of fragments, and that this determinism is useful only for testing and verification purposes:
In short, we will not guarantee determinism of output.
TODO: think more about whether there is an efficient way to offer determinism.
2.4 MVCC keys, i.e., may have suffix
As a reminder, we will have range writes of the form
Set([k1, k2), <ver>, <value>)
.k1+<ver>
andk2+<ver>
will give us prefix k1 and k2 respectively (+ is simply concatenation).k+<ver> => <value>
(in some cases<ver>
is empty for these point writes).k+<ver>
.k < k+<ver>
(when<ver>
is non-empty), CockroachDB has adopted this behavior since it results in the following clean behavior: RANGEDEL over [k1, k2) deletes all versioned keys which have prefixes in the interval [k1, k2). Pebble will now require this behavior for anyone using MVCC keys.Consider the following state in terms of user keys (the @Number part is the version)
point keys: a@100, a@30, b@100, b@40, b@30, c@40, d@40, e@30
range key: [a,e)@50
If we consider the user key ordering across the union of these keys, where we have used ' and '' to mark the start and end keys for the range key, we get:
a@100, a@50', a@30, b@100, b@40, b@30, c@40, d@40, e@50'', e@30
A pebble.Iterator, which shows user keys, will output (during forward iteration), the following keys:
a@100, [a,e)@50, a@30, b@100, [b,e)@50, b@40, b@30, [c,e)@50, c@40, [d,e)@50, d@40, e@30
2.5 Splitting MVCC keys across sstables
Splitting of an MVCC range key is always done at the key prefix boundary.
Consider the earlier example and let us allow the same prefix b to span multiple sstables. The two sstables would have the following user keys:
In practice the compaction code would need to know that the point key after b@30 has key prefix c, in order to fragment the range into [a,c)@50, for inclusion in the first sstable. This is messy wrt code, so a prerequisite for using range keys is to define an
ImmediateSuccesor
function on the key prefix. The compaction code can then include [a,ImmediateSuccessor(b))@50 in the first sstable and [ImmediateSuccessor(b),e)@50 in the second sstable. But this raises another problem: the end bound of the first sstable is ImmediateSuccessor(b)@50#inf and the start bound of the second sstable is b@30#s, where s is some seqnum. Such overlapping bounds are not permitted.We consider two high-level options to address this problem.
Not have a key prefix span multiple files: This would disallow splitting after b@40, since b@30 remains. Requiring all point versions for an MVCC key to be in one sstable is dangerous given that histories can be maintained for a long time. This would also potentially increase write amplification. So we do not consider this further.
Use the ImmediateSuccessor to make range keys behave as points, when necessary: Since the prefix b is spanning multiple sstables in this example, the range key would get split into [a,b)@50, [b,ImmediateSuccessor(b))@50, [ImmediateSuccessor(b),e)@50.
Hence the end bound for the first sstable is b@40#s1, which does not overlap with the b@30#s2 start of the second sstable.
If the second sstable starts on a new point key prefix we would use that prefix for splitting the range keys in the first sstable.
We adopt the second option. However, now it is possible for b@100#s2 to be at a lower LSM level than [a,e)@50#s1 where s1<s2. This is because the LSM invariant only works over user keys, and not over user key prefixes, and we have deliberately chosen this behavior.
Truncation of the range keys using the configured upper bound of the iterator can only be done if the configured upper bound is its own prefix. That is, we will not permit such iterators to have an upper bound like d@200. It has to be a prefix like d. This is to ensure the truncation is sound -- we are not permitting these range keys to have a different suffix for the start and end keys. If the user desires a tighter upper bound than what this rule permits, they can do the bounds checking themselves.
2.6 Strengthening RANGEDEL semantics
The current RANGEDEL behavior is defined over points and is oblivious to the prefix/suffix behavior of MVCC keys. The common case is a RANGEDEL of the form [b,d), i.e., no suffix. Such a RANGEDEL can be cleanly applied to range keys since it is not limited to a subset of versions. However, currently one can also construct RANGEDELs like [b@30,d@90), which are acceptable for bulk deletion of points, but semantically hard to handle when applied against a range key [a,e)@50. To deal with this we will keep the current RANGEDEL to apply to only point keys, and introduce RANGEDEL_PREFIX, which must have start and end keys that are equal to their prefix.
2.7 Efficient masking of old MVCC versions
Recollect that in the earlier example (before we discussed splitting) the pebble.Iterator would output (during forward iteration), the following keys:
a@100, [a,e)@50, a@30, b@100, [b,e)@50, b@40, b@30, [c,e)@50, c@40, [d,e)@50, d@40, e@30
It would be desirable if the caller could indicate a desire to skip over a@30, b@40, b@30, c@40, d@40, and this could be done efficiently. We will support iteration in this MVCC-masked mode, when specified as an iterator option. This iterator option would also need to specify the version such that the caller wants all values >= version.
To efficiently implement this we cannot rely on the LSM invariant since b@100 can be at a lower level than [a,e)@50. However, we do have (a) per-block key bounds, and (b) for time-bound iteration purposes we will be maintaining block properties for the timestamp range (in CockroachDB). We will require that for efficient masking, the user provide the name of the block property to utilize. In this example, when the iterator will encounter [a,e)@100 it can skip blocks whose keys are guaranteed to be in [a,e) and whose timestamp interval is < 100.
The text was updated successfully, but these errors were encountered: