kv: record non-MVCC operations to support ExportRequests and RangeFeed #62585

ajwerner · 2021-03-25T05:02:59Z

Is your feature request related to a problem? Please describe.

The KV layer has two non-MVCC operations which can affect range data: AddSSTable and ClearRange. The SQL layer, in general, goes through great pains to ensure that readers do not read under these operations. These operations, however, have a bad interaction with physical incremental backups and newer streaming asynchronous replication.

The problem for incremental backups, in particular, is AddSSTable. The problem with AddSSTable is that it can rewrite history. In particular, it is possible for an AddSSTable to write keys at timestamps which fall before the end time of the previous incremental backup. These keys will not be picked up by the next incremental. This turns out to be a big correctness problems for indexes which are backed up while they are being built (#62564). To mitigate this correctness problem in the short term, we are going to stop including indexes which are non-public in incremental backups. This will be problematic for use cases where the incremental window in use is short relative to the duration of index backfills.

Streaming replication, based on RangeFeed, has problems with both of these requests. In particular, when we do an AddSSTable, we end up restarting the RangeFeed but the catch-up scan will be from the last resolved timestamp. Any writes in the SST from before that timestamp will not be seen. ClearRange is effectively invisible to the RangeFeed.

Given the importance both of streaming replication and of incremental backups, we need to do something here.

Describe the solution you'd like

The thrust of this proposal is that we should do some bookkeeping in the replicated range state to note when these non-MVCC operations have occurred. We'll use this information to drive correct behavior during ExportRequest with time bounds and during RangeFeed catch up scans.

This issue proposes that we add a new range-local, replicated prefix under which we record, keyed on the timestamp at which non-MVCC operations occurred, the relevant spans they covered, the operation, and other metadata.

The sketch of the keyspace layout will be something like:

// LocalRangeNonMVCCEvents is a prefix in the replicated, range-local keyspace 
// for a timestamp-oriented store of the history of non-MVCC operations
LocalRangeNonMVCCEvents = roachpb.Key(makeKey(LocalPrefix, roachpb.RKey("nmvcc")))

The keys which will be used in this keyspace will have the above prefix and a timestamp. This timestamp represents the timestamp at which these operations occurred. I believe this is safe because the server assigns the timestamp and these operations require a single request in the batch [1].

Inside the values we'll store metadata. For a ClearRange we just need the span. For an AddSSTable, we'll want both the span and, at least, the minimum timestamp of any key in the SST. We'll use this information to inform what information to make visible to a time bound iteration. Namely, in addition to the normal keys we'd expose to a time-bound iteration, we should expose all keys in time bounds and spans that may have been covered by AddSSTable events which occurred in that timebound. I'm not totally clear on the interaction of ClearRange on incremental backups but I do think that we'll need to at least know about them for streaming replication and they naturally fit in the same model. To account for the streaming replication, we will probably want to add some new RangeFeed message.

We can GC these records by their timestamp. Today we require for both catch-up scans and export requests that the entire time interval fall into the GC threshold. A related note is that when reading the values in some span and time interval due to an AddSSTable, if the oldest key in the SST is older than the GC Threshold, it should be safe to perform the read at the GC threshold. In fact, limiting the lower bound on the timestamp is, in a sense, an optimization.

All of this, however, does sort of redefine what these lower bounds mean on these various requests. In order to clarify the behavior, all of this interaction with non-MVCC operations should be opt-in. In fact, normal RangeFeeds which do not opt into this behavior should instead fail with an error if such a non-MVCC operation is encountered.

Describe alternatives you've considered

There was a proposal to try to force the SST timestamps to fall before the closed timestamp but that's a tough sell for a variety of reasons.

Additional context

This tracking may also enable mitigation for #31563.

Update: [1] We used to never assign a timestamp to the batch header for AddSSTable but we've very recently (#64023) started to set that to a timestamp at which the relevant read occurred. This is important to ensure that a delete tombstone is not lost to GC, erroneously revealing deleted data.

Epic: CRDB-2624

The text was updated successfully, but these errors were encountered:

ajwerner · 2021-03-25T05:03:13Z

cc @dt

ajwerner · 2021-03-25T13:33:32Z

The logic about what data to return in the export request. I'm curious if it would suffice to just do a normal (non-timebound) scan of the data at the StartTime over the span of the SST. If there were deletes in the SST that then were not covered by writes, that would be a problem, but I'm not sure that that ever happens.

dt · 2021-03-25T13:36:24Z

I think we want to do a time-bound scan from the SST's min time since we ingest wide, covering SSTs all the time in index backfills. Doing a full non-TBI scan could then emit lots of extra keys vs just those in the SST's time span? As long as we're persisting the SST's key bounds, persisting it's time-bounds should be easy -- we already have them in the table props where we can read them during application in o(1) without iterating.

Then we need to pass those as a start_time skyline to inc iterator.

ajwerner · 2021-04-22T20:06:36Z

@dt now this has a fun interaction with #64023. Namely, we want the timestamp of the AddSSTable request to be above the closed timestamp but the read timestamp we're applying in that change is definitely not going to have that property. I think all of this points to wanting another timestamp in these requests such that the AddSSTable is said to be proposed at "now" but somewhere else we stash a timestamp which we ensure is above the GC threshold (or TTL).

In 20.2.4, a changefeed would fail if IMPORT INTO was run against one of its target tables. The failure would look like: ``` I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing› (1) Wraps: (2) attached stack trace -- stack trace: | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758 | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859 | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128 | runtime.goexit | /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371 Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing› Error types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError ``` We want changefeed to fail when IMPORT INTO is run because changes via the AddSSTable mechanism is not currently reflected in the changefeed, meaning we would fail to emit imported data. The previous path that raised this failure depended on: 1) The descriptor being offline at the point we attempted to acquire a lease on it: https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514 2) The lease acquisition filtering out offline descriptors with an error: https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209 3) The failure from the lease acquisition in the schemafeed being treated as a fatal error. I believe our behaviour here has changed a few times on both the 20.2 branch and master because of changes in each of these 3 behaviours. In this change, rather than relying on the lease acquisition, we specifically check for offline tables in our ValidateTable function. This function is called for every descriptor version we get from the ExportRequest on the Descriptor table. Currently, I believe that checking for the offline descriptors is correct since it appears that only restore and import put tables into an offline state. Release note (enterprise change): CHANGEFEEDs more reliably fail when IMPORT INTO is run against a targeted table. Fixes cockroachdb#64276 See also cockroachdb#62585, cockroachdb#43784

64323: changefeedccl: fail changefeeds when tables go offline r=miretskiy a=stevendanna In 20.2.4, a changefeed would fail if IMPORT INTO was run against one of its target tables. The failure would look like: ``` I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing› (1) Wraps: (2) attached stack trace -- stack trace: | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758 | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859 | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128 | runtime.goexit | /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371 Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing› Error types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError ``` We want changefeed to fail when IMPORT INTO is run because changes via the AddSSTable mechanism is not currently reflected in the changefeed, meaning we would fail to emit imported data. The previous path that raised this failure depended on: 1) The descriptor being offline at the point we attempted to acquire a lease on it: https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514 2) The lease acquisition filtering out offline descriptors with an error: https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209 3) The failure from the lease acquisition in the schemafeed being treated as a fatal error. I believe our behaviour here has changed a few times on both the 20.2 branch and master because of changes in each of these 3 behaviours. In this change, rather than relying on the lease acquisition, we specifically check for offline tables in our ValidateTable function. This function is called for every descriptor version we get from the ExportRequest on the Descriptor table. Currently, I believe that checking for the offline descriptors is correct since it appears that only restore and import put tables into an offline state. Release note (enterprise change): CHANGEFEEDs more reliably fail when IMPORT INTO is run against a targeted table. Fixes #64276 See also #62585, #43784 Co-authored-by: Steven Danna <danna@cockroachlabs.com>

In 20.2.4, a changefeed would fail if IMPORT INTO was run against one of its target tables. The failure would look like: ``` I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing› (1) Wraps: (2) attached stack trace -- stack trace: | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758 | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859 | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128 | runtime.goexit | /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371 Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing› Error types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError ``` We want changefeed to fail when IMPORT INTO is run because changes via the AddSSTable mechanism is not currently reflected in the changefeed, meaning we would fail to emit imported data. The previous path that raised this failure depended on: 1) The descriptor being offline at the point we attempted to acquire a lease on it: https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514 2) The lease acquisition filtering out offline descriptors with an error: https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209 3) The failure from the lease acquisition in the schemafeed being treated as a fatal error. I believe our behaviour here has changed a few times on both the 20.2 branch and master because of changes in each of these 3 behaviours. In this change, rather than relying on the lease acquisition, we specifically check for offline tables in our ValidateTable function. This function is called for every descriptor version we get from the ExportRequest on the Descriptor table. Currently, I believe that checking for the offline descriptors is correct since it appears that only restore and import put tables into an offline state. Release note (enterprise change): CHANGEFEEDs more reliably fail when IMPORT INTO is run against a targeted table. Fixes cockroachdb#64276 See also cockroachdb#62585, cockroachdb#43784

nvanbenschoten · 2021-05-19T17:25:27Z

Thanks for writing this up, @ajwerner. There's a lot to unpack in here, but I think this is moving in the right direction.

The general idea to store a versioned sequence of bulk operations on a Range allows us to start rationalizing interactions between these bulk operations and other components (rangefeed, closed timestamps, export, etc.) in the KV layer.

Interactions with closed timestamps, in particular, have been concerning me recently, as we have no way to express that a table/index should not be brought online until the closed timestamps in its key range can be trusted and won't be violated by AddSSTable operations (whose writes do not respect the closed timestamp) that are still replicating to some follower replicas. But with this scheme, we could ensure that the timestamp of the bulk operation itself respects the closed timestamp. So we could say that a table/index can only be brought online after the timestamp of its bulk operations. This would prevent follower reads served shortly after a schema change from missing some of the effects of the schema change.

I'm also happy to see that we filed #65128 after our discussion last week. In my mind, these are very related issues. A versioned sequence of bulk operations allows us to start talking about idempotency and about replay protection. For instance, if we assign a timestamp to each bulk operation then a replay can be easily detected and rejected. Similarly, if we assign a timestamp to each bulk operation then we can box out delayed application on a rollback by writing no-op guards at the maximum timestamp that the bulk operation could have been performed at. An approach like this seems safer than relying on clock synchronization (#65128 (comment)), which we are ok with for very specific cases with known failure modes (i.e. stale reads) but probably shouldn't be ok with for situations that could result in more severe failure modes like data corruption.

This issue proposes that we add a new range-local, replicated prefix under which we record, keyed on the timestamp at which non-MVCC operations occurred, the relevant spans they covered, the operation, and other metadata.

I assume these records will be stored in the range-ID local keyspace. In other words, since they aren't bound to a specific key, they need to be associated with the entire range. This also means that they need to be copied onto both sides of a Range split.

Inside the values we'll store metadata. For a ClearRange we just need the span. For an AddSSTable, we'll want both the span and, at least, the minimum timestamp of any key in the SST. We'll use this information to inform what information to make visible to a time bound iteration.

The use of range-level metadata to dictate the behavior of an MVCC scan reminds me of #16294.

ajwerner · 2021-05-20T00:43:24Z

I assume these records will be stored in the range-ID local keyspace. In other words, since they aren't bound to a specific key, they need to be associated with the entire range.

I don't know about that. While they aren't associated with a key, they are associated with a span. Consider a ClearRange over some table span. We generally expect that range to get merged away. We definitely want to retain the history of that operation post-merge.

My sense is that thinking through split and merge behavior will be handy. In my straw-man vision for this, I'd suspect that on scan we'd scan the entire key span recording these events and copy out metadata that covers the right hand side to the right hand side. I don't think I feel the same way about merging state back together (which smells).

nvanbenschoten · 2021-05-25T16:47:19Z

I don't know about that. While they aren't associated with a key, they are associated with a span. Consider a ClearRange over some table span. We generally expect that range to get merged away. We definitely want to retain the history of that operation post-merge.

Range-ID local keys don't necessarily disappear during a range merge. For instance, we copy over the abort span from the RHS of a merge to the LHS. So we could encode the desired splitting/merging logic regardless of whether we associate these operations with keys (range local) or not (range-ID local). If we did associate these operations with keys, I guess we'd store them at their start key and encode their end key in their value.

The other consideration is whether these operations should be indexed by their timestamp. This would be important if we want some kind of monotonicity property for writes to this log. Indexing by timestamp could be done by either including a timestamp in the keys manually or by using MVCC-keys. I'm not sure whether there's precedent for MVCC range-ID local keys, but I don't think there's anything fundamentally wrong with them.

erikgrinaker · 2021-06-02T11:53:57Z

As a first stab, I'm thinking we'll use range local keys keyed by timestamp and start key (in that order), with the end key in the metadata value.

Range local seems to make sense since they're associated with a key (span). Keying primarily by timestamp should be more efficient because MVCC reads have to check for non-MVCC operations in their future and most reads happen in the recent past -- this way, we can skip more keys than if we'd index by start key first and scan all historical non-MVCC ops that intersect the read span.

Indexing by timestamp could be done by either including a timestamp in the keys manually or by using MVCC-keys. I'm not sure whether there's precedent for MVCC range-ID local keys, but I don't think there's anything fundamentally wrong with them.

While MVCC keys have the same structure (key and timestamp), we'd want to encode them differently (timestamp first). I also think using MVCC keys to represent non-MVCC operations would just get confusing since their semantics are so different.

dt · 2021-06-02T12:03:37Z

Another point in favor of timestamp-first indexing: many non-mvcc ops have very wide spans (in many ClearRange cases it is the range’s entire span), so span intersection might not be very selective, if at all, compared to time-bounds.

erikgrinaker · 2021-09-20T15:42:42Z

Replaced by #70429.

ajwerner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. labels Mar 25, 2021

ajwerner mentioned this issue Apr 15, 2021

backfill: index backfills should write rows at their current MVCC timestamp #63741

Closed

stevendanna mentioned this issue Apr 27, 2021

changefeeds: changefeed should fail or otherwise indicate a problem if IMPORT INTO was executed #64276

Closed

stevendanna mentioned this issue Apr 28, 2021

changefeedccl: fail changefeeds when tables go offline #64323

Merged

stevendanna mentioned this issue Apr 29, 2021

release-21.1: changefeedccl: fail changefeeds when tables go offline #64372

Merged

erikgrinaker self-assigned this May 6, 2021

nvanbenschoten mentioned this issue May 19, 2021

bulk *: prevent delayed application of non-mvcc commands #65128

Closed

ajwerner mentioned this issue May 28, 2021

kv: timetravel queries do not error on ClearRange'd data #31563

Closed

ajwerner mentioned this issue Jun 7, 2021

kv: re-enable time-bound iterators for KV requests #53348

Closed

jlinder added the T-kv KV Team label Jun 16, 2021

erikgrinaker removed their assignment Sep 8, 2021

erikgrinaker closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: record non-MVCC operations to support ExportRequests and RangeFeed #62585

kv: record non-MVCC operations to support ExportRequests and RangeFeed #62585

ajwerner commented Mar 25, 2021 •

edited by lunevalex

Loading

ajwerner commented Mar 25, 2021

ajwerner commented Mar 25, 2021

dt commented Mar 25, 2021 •

edited

Loading

ajwerner commented Apr 22, 2021

nvanbenschoten commented May 19, 2021

ajwerner commented May 20, 2021

nvanbenschoten commented May 25, 2021

erikgrinaker commented Jun 2, 2021

dt commented Jun 2, 2021

erikgrinaker commented Sep 20, 2021

kv: record non-MVCC operations to support ExportRequests and RangeFeed #62585

kv: record non-MVCC operations to support ExportRequests and RangeFeed #62585

Comments

ajwerner commented Mar 25, 2021 • edited by lunevalex Loading

ajwerner commented Mar 25, 2021

ajwerner commented Mar 25, 2021

dt commented Mar 25, 2021 • edited Loading

ajwerner commented Apr 22, 2021

nvanbenschoten commented May 19, 2021

ajwerner commented May 20, 2021

nvanbenschoten commented May 25, 2021

erikgrinaker commented Jun 2, 2021

dt commented Jun 2, 2021

erikgrinaker commented Sep 20, 2021

ajwerner commented Mar 25, 2021 •

edited by lunevalex

Loading

dt commented Mar 25, 2021 •

edited

Loading