-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: record non-MVCC operations to support ExportRequests and RangeFeed #62585
Comments
cc @dt |
The logic about what data to return in the export request. I'm curious if it would suffice to just do a normal (non-timebound) scan of the data at the StartTime over the span of the SST. If there were deletes in the SST that then were not covered by writes, that would be a problem, but I'm not sure that that ever happens. |
I think we want to do a time-bound scan from the SST's min time since we ingest wide, covering SSTs all the time in index backfills. Doing a full non-TBI scan could then emit lots of extra keys vs just those in the SST's time span? As long as we're persisting the SST's key bounds, persisting it's time-bounds should be easy -- we already have them in the table props where we can read them during application in o(1) without iterating. Then we need to pass those as a start_time skyline to inc iterator. |
@dt now this has a fun interaction with #64023. Namely, we want the timestamp of the AddSSTable request to be above the closed timestamp but the read timestamp we're applying in that change is definitely not going to have that property. I think all of this points to wanting another timestamp in these requests such that the AddSSTable is said to be proposed at "now" but somewhere else we stash a timestamp which we ensure is above the GC threshold (or TTL). |
In 20.2.4, a changefeed would fail if IMPORT INTO was run against one of its target tables. The failure would look like: ``` I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing› (1) Wraps: (2) attached stack trace -- stack trace: | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758 | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859 | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128 | runtime.goexit | /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371 Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing› Error types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError ``` We want changefeed to fail when IMPORT INTO is run because changes via the AddSSTable mechanism is not currently reflected in the changefeed, meaning we would fail to emit imported data. The previous path that raised this failure depended on: 1) The descriptor being offline at the point we attempted to acquire a lease on it: https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514 2) The lease acquisition filtering out offline descriptors with an error: https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209 3) The failure from the lease acquisition in the schemafeed being treated as a fatal error. I believe our behaviour here has changed a few times on both the 20.2 branch and master because of changes in each of these 3 behaviours. In this change, rather than relying on the lease acquisition, we specifically check for offline tables in our ValidateTable function. This function is called for every descriptor version we get from the ExportRequest on the Descriptor table. Currently, I believe that checking for the offline descriptors is correct since it appears that only restore and import put tables into an offline state. Release note (enterprise change): CHANGEFEEDs more reliably fail when IMPORT INTO is run against a targeted table. Fixes cockroachdb#64276 See also cockroachdb#62585, cockroachdb#43784
In 20.2.4, a changefeed would fail if IMPORT INTO was run against one of its target tables. The failure would look like: ``` I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing› (1) Wraps: (2) attached stack trace -- stack trace: | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758 | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859 | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128 | runtime.goexit | /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371 Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing› Error types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError ``` We want changefeed to fail when IMPORT INTO is run because changes via the AddSSTable mechanism is not currently reflected in the changefeed, meaning we would fail to emit imported data. The previous path that raised this failure depended on: 1) The descriptor being offline at the point we attempted to acquire a lease on it: https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514 2) The lease acquisition filtering out offline descriptors with an error: https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209 3) The failure from the lease acquisition in the schemafeed being treated as a fatal error. I believe our behaviour here has changed a few times on both the 20.2 branch and master because of changes in each of these 3 behaviours. In this change, rather than relying on the lease acquisition, we specifically check for offline tables in our ValidateTable function. This function is called for every descriptor version we get from the ExportRequest on the Descriptor table. Currently, I believe that checking for the offline descriptors is correct since it appears that only restore and import put tables into an offline state. Release note (enterprise change): CHANGEFEEDs more reliably fail when IMPORT INTO is run against a targeted table. Fixes cockroachdb#64276 See also cockroachdb#62585, cockroachdb#43784
64323: changefeedccl: fail changefeeds when tables go offline r=miretskiy a=stevendanna In 20.2.4, a changefeed would fail if IMPORT INTO was run against one of its target tables. The failure would look like: ``` I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing› (1) Wraps: (2) attached stack trace -- stack trace: | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758 | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859 | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128 | runtime.goexit | /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371 Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing› Error types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError ``` We want changefeed to fail when IMPORT INTO is run because changes via the AddSSTable mechanism is not currently reflected in the changefeed, meaning we would fail to emit imported data. The previous path that raised this failure depended on: 1) The descriptor being offline at the point we attempted to acquire a lease on it: https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514 2) The lease acquisition filtering out offline descriptors with an error: https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209 3) The failure from the lease acquisition in the schemafeed being treated as a fatal error. I believe our behaviour here has changed a few times on both the 20.2 branch and master because of changes in each of these 3 behaviours. In this change, rather than relying on the lease acquisition, we specifically check for offline tables in our ValidateTable function. This function is called for every descriptor version we get from the ExportRequest on the Descriptor table. Currently, I believe that checking for the offline descriptors is correct since it appears that only restore and import put tables into an offline state. Release note (enterprise change): CHANGEFEEDs more reliably fail when IMPORT INTO is run against a targeted table. Fixes #64276 See also #62585, #43784 Co-authored-by: Steven Danna <danna@cockroachlabs.com>
In 20.2.4, a changefeed would fail if IMPORT INTO was run against one of its target tables. The failure would look like: ``` I210428 10:45:57.982012 2015 jobs/registry.go:1131 ⋮ [n1] CHANGEFEED job 653840730248282113: stepping through state failed with error: ‹relation› ‹"test"› is offline: ‹importing› (1) Wraps: (2) attached stack trace -- stack trace: | github.com/cockroachdb/cockroach/pkg/sql/catalog.FilterDescriptorState | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descriptor.go:387 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:219 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:758 | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:808 | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:757 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.storage.acquire | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:193 | github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.acquireNodeLease.func1 | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:859 | github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall | /Users/ssd/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128 | runtime.goexit | /usr/local/Cellar/go/1.16.3/libexec/src/runtime/asm_amd64.s:1371 Wraps: (3) ‹relation› ‹"test"› is offline: ‹importing› Error types: (1) *catalog.inactiveDescriptorError (2) *withstack.withStack (3) *errutil.leafError ``` We want changefeed to fail when IMPORT INTO is run because changes via the AddSSTable mechanism is not currently reflected in the changefeed, meaning we would fail to emit imported data. The previous path that raised this failure depended on: 1) The descriptor being offline at the point we attempted to acquire a lease on it: https://github.com/cockroachdb/cockroach/blob/d1962910b58005096ce411bccbaddcd0c1d30cbd/pkg/ccl/changefeedccl/schemafeed/schema_feed.go#L514 2) The lease acquisition filtering out offline descriptors with an error: https://github.com/cockroachdb/cockroach/blob/eda2309728392593162e962a61182eab6ab003ff/pkg/sql/catalog/descriptor.go#L209 3) The failure from the lease acquisition in the schemafeed being treated as a fatal error. I believe our behaviour here has changed a few times on both the 20.2 branch and master because of changes in each of these 3 behaviours. In this change, rather than relying on the lease acquisition, we specifically check for offline tables in our ValidateTable function. This function is called for every descriptor version we get from the ExportRequest on the Descriptor table. Currently, I believe that checking for the offline descriptors is correct since it appears that only restore and import put tables into an offline state. Release note (enterprise change): CHANGEFEEDs more reliably fail when IMPORT INTO is run against a targeted table. Fixes cockroachdb#64276 See also cockroachdb#62585, cockroachdb#43784
Thanks for writing this up, @ajwerner. There's a lot to unpack in here, but I think this is moving in the right direction. The general idea to store a versioned sequence of bulk operations on a Range allows us to start rationalizing interactions between these bulk operations and other components (rangefeed, closed timestamps, export, etc.) in the KV layer. Interactions with closed timestamps, in particular, have been concerning me recently, as we have no way to express that a table/index should not be brought online until the closed timestamps in its key range can be trusted and won't be violated by AddSSTable operations (whose writes do not respect the closed timestamp) that are still replicating to some follower replicas. But with this scheme, we could ensure that the timestamp of the bulk operation itself respects the closed timestamp. So we could say that a table/index can only be brought online after the timestamp of its bulk operations. This would prevent follower reads served shortly after a schema change from missing some of the effects of the schema change. I'm also happy to see that we filed #65128 after our discussion last week. In my mind, these are very related issues. A versioned sequence of bulk operations allows us to start talking about idempotency and about replay protection. For instance, if we assign a timestamp to each bulk operation then a replay can be easily detected and rejected. Similarly, if we assign a timestamp to each bulk operation then we can box out delayed application on a rollback by writing no-op guards at the maximum timestamp that the bulk operation could have been performed at. An approach like this seems safer than relying on clock synchronization (#65128 (comment)), which we are ok with for very specific cases with known failure modes (i.e. stale reads) but probably shouldn't be ok with for situations that could result in more severe failure modes like data corruption.
I assume these records will be stored in the range-ID local keyspace. In other words, since they aren't bound to a specific key, they need to be associated with the entire range. This also means that they need to be copied onto both sides of a Range split.
The use of range-level metadata to dictate the behavior of an MVCC scan reminds me of #16294. |
I don't know about that. While they aren't associated with a key, they are associated with a span. Consider a My sense is that thinking through split and merge behavior will be handy. In my straw-man vision for this, I'd suspect that on scan we'd scan the entire key span recording these events and copy out metadata that covers the right hand side to the right hand side. I don't think I feel the same way about merging state back together (which smells). |
Range-ID local keys don't necessarily disappear during a range merge. For instance, we copy over the abort span from the RHS of a merge to the LHS. So we could encode the desired splitting/merging logic regardless of whether we associate these operations with keys (range local) or not (range-ID local). If we did associate these operations with keys, I guess we'd store them at their start key and encode their end key in their value. The other consideration is whether these operations should be indexed by their timestamp. This would be important if we want some kind of monotonicity property for writes to this log. Indexing by timestamp could be done by either including a timestamp in the keys manually or by using MVCC-keys. I'm not sure whether there's precedent for MVCC range-ID local keys, but I don't think there's anything fundamentally wrong with them. |
As a first stab, I'm thinking we'll use range local keys keyed by timestamp and start key (in that order), with the end key in the metadata value. Range local seems to make sense since they're associated with a key (span). Keying primarily by timestamp should be more efficient because MVCC reads have to check for non-MVCC operations in their future and most reads happen in the recent past -- this way, we can skip more keys than if we'd index by start key first and scan all historical non-MVCC ops that intersect the read span.
While MVCC keys have the same structure (key and timestamp), we'd want to encode them differently (timestamp first). I also think using MVCC keys to represent non-MVCC operations would just get confusing since their semantics are so different. |
Another point in favor of timestamp-first indexing: many non-mvcc ops have very wide spans (in many ClearRange cases it is the range’s entire span), so span intersection might not be very selective, if at all, compared to time-bounds. |
Replaced by #70429. |
Is your feature request related to a problem? Please describe.
The KV layer has two non-MVCC operations which can affect range data:
AddSSTable
andClearRange
. The SQL layer, in general, goes through great pains to ensure that readers do not read under these operations. These operations, however, have a bad interaction with physical incremental backups and newer streaming asynchronous replication.The problem for incremental backups, in particular, is
AddSSTable
. The problem withAddSSTable
is that it can rewrite history. In particular, it is possible for anAddSSTable
to write keys at timestamps which fall before the end time of the previous incremental backup. These keys will not be picked up by the next incremental. This turns out to be a big correctness problems for indexes which are backed up while they are being built (#62564). To mitigate this correctness problem in the short term, we are going to stop including indexes which are non-public in incremental backups. This will be problematic for use cases where the incremental window in use is short relative to the duration of index backfills.Streaming replication, based on
RangeFeed
, has problems with both of these requests. In particular, when we do anAddSSTable
, we end up restarting theRangeFeed
but the catch-up scan will be from the last resolved timestamp. Any writes in the SST from before that timestamp will not be seen.ClearRange
is effectively invisible to theRangeFeed
.Given the importance both of streaming replication and of incremental backups, we need to do something here.
Describe the solution you'd like
The thrust of this proposal is that we should do some bookkeeping in the replicated range state to note when these non-MVCC operations have occurred. We'll use this information to drive correct behavior during
ExportRequest
with time bounds and duringRangeFeed
catch up scans.This issue proposes that we add a new range-local, replicated prefix under which we record, keyed on the timestamp at which non-MVCC operations occurred, the relevant spans they covered, the operation, and other metadata.
The sketch of the keyspace layout will be something like:
The keys which will be used in this keyspace will have the above prefix and a timestamp. This timestamp represents the timestamp at which these operations occurred. I believe this is safe because the server assigns the timestamp and these operations require a single request in the batch [1].
Inside the values we'll store metadata. For a
ClearRange
we just need the span. For anAddSSTable
, we'll want both the span and, at least, the minimum timestamp of any key in the SST. We'll use this information to inform what information to make visible to a time bound iteration. Namely, in addition to the normal keys we'd expose to a time-bound iteration, we should expose all keys in time bounds and spans that may have been covered byAddSSTable
events which occurred in that timebound. I'm not totally clear on the interaction ofClearRange
on incremental backups but I do think that we'll need to at least know about them for streaming replication and they naturally fit in the same model. To account for the streaming replication, we will probably want to add some newRangeFeed
message.We can GC these records by their timestamp. Today we require for both catch-up scans and export requests that the entire time interval fall into the GC threshold. A related note is that when reading the values in some span and time interval due to an AddSSTable, if the oldest key in the SST is older than the GC Threshold, it should be safe to perform the read at the GC threshold. In fact, limiting the lower bound on the timestamp is, in a sense, an optimization.
All of this, however, does sort of redefine what these lower bounds mean on these various requests. In order to clarify the behavior, all of this interaction with non-MVCC operations should be opt-in. In fact, normal
RangeFeed
s which do not opt into this behavior should instead fail with an error if such a non-MVCC operation is encountered.Describe alternatives you've considered
There was a proposal to try to force the SST timestamps to fall before the closed timestamp but that's a tough sell for a variety of reasons.
Additional context
This tracking may also enable mitigation for #31563.
Update: [1] We used to never assign a timestamp to the batch header for AddSSTable but we've very recently (#64023) started to set that to a timestamp at which the relevant read occurred. This is important to ensure that a delete tombstone is not lost to GC, erroneously revealing deleted data.
Epic: CRDB-2624
The text was updated successfully, but these errors were encountered: