Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: Revert Command #16294

Closed
wants to merge 1 commit into from
Closed

Conversation

spencerkimball
Copy link
Member

No description provided.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/delete-range-rfc branch from a512ba8 to 25b4274 Compare June 2, 2017 20:46
@bdarnell
Copy link
Contributor

bdarnell commented Jun 2, 2017

Review status: 0 of 2 files reviewed at latest revision, 17 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 19 at r1 (raw file):

The primary goal for an enhanced `DeleteRange` command is efficiency.
A secondary goal is to lay the foundation for building a point-in-time
recovery mechanism, capable of reversing mutations older than a

s/older/newer/


docs/RFCS/fast_delete_range.md, line 23 at r1 (raw file):

It is important to disambiguate the word "range" in this document. In
the context of the `DeleteRange` command, it refers to any range of

We have generally switched to the word "span" for this; it may help to replace lowercase "range" with "span" in this doc.

The DeleteRange command is one of the last places where we use the word "range" in this sense. I think we haven't renamed it because of fears about RPC compatibility, but I don't think that's actually a concern and we could rename it to DeleteSpan.

This paragraph should probably go up in the summary or in a new "Background" section.


docs/RFCS/fast_delete_range.md, line 32 at r1 (raw file):

defined in `/pkg/storage/range.go`).

The current implementations of `DROP (TABLE|INDEX)` and `TRUNCATE

This paragraph is unclear - it says these two (three?) statements are similar (to each other, apparently) then immediately starts listing differences without ever saying what they have in common. I think it's enough for the purposes of the motivation section to say that DROP and TRUNCATE use the DeleteRange command, which makes them extremely slow.

Maybe also mention that DeleteRange has gotten much slower as a consequence of propEvalKV, since it must write all affected keys into the raft log.


docs/RFCS/fast_delete_range.md, line 68 at r1 (raw file):

feature, point-in-time recovery. Point-in-time recovery is similar to
queries at a historical timestamp ("time travel") via the `SELECT
... AS OF DATABASE TIME <X>` syntax. The difference is that

s/DATABASE/SYSTEM/


docs/RFCS/fast_delete_range.md, line 93 at r1 (raw file):

This fundamental mechanism will remain in place. The **key performance
optimization** is enabled *only* if the `DeleteRange` command covers
the entire key range of a Range. If it only covers `99.9%` of the

There are some potential edge cases here: we automatically split at table boundaries, but most sql range operations run over index spans. That is, we have a Range with StartKey /Table/50 and EndKey /Table/51, but a full scan of that table by primary key uses StartKey /Table/50/1 and EndKey /Table/50/2. We may need to make some adjustments to make scans and table boundaries align, or to teach the system that nothing can exist between /Table/50 and /Table/50/1


docs/RFCS/fast_delete_range.md, line 111 at r1 (raw file):

type Tombstone struct {
  Timestamp    hlc.Timestamp // Range values before this timestamp considered deleted
  EndTimestamp hlc.Timestamp // Non-zero for deletion up to historical timestamp

What sort of overlaps are allowed between Tombstones? Full deletions (which always start with EndTimestamp zero, I assume?) must be allowed to overlap previous point-in-time recoveries. But you must not be allowed to recover into a timestamp range that was affected by another recovery or deletion.

It was not obvious to me on my first read of this RFC that Timestamp > EndTimestamp. I assumed from the names that EndTimestamp would be the higher one. I think I've adjusted my comments now that I've understood this, but if something looks backwards that's probably why. Maybe EndTimestamp should be renamed StartTimestamp?


docs/RFCS/fast_delete_range.md, line 115 at r1 (raw file):

FullRangeTombstone is a replicated key-range-local key so that it

So it's a single versioned key, which unlike all other keys operates as if its value is the union of all its non-GC'd versions?


docs/RFCS/fast_delete_range.md, line 119 at r1 (raw file):

accept intent resolutions. Note that the `EndTimestamp` for a
point-in-time recovery tombstone must be more recent than the Range's
`GCThreshold`.

We should probably require Timestamp to be the timestamp of the transaction that wrote this value. Otherwise I can imagine various weird things (a Timestamp in the future would leave a "black hole" that couldn't be written into until the time expires; a Timestamp in the past would leave an inconsistent mix of previously-written values in place.


docs/RFCS/fast_delete_range.md, line 137 at r1 (raw file):

  // history is a sorted slice, cached from materialized versioned
  // values of the Range's FullRangeTombstone key.
  history []Tombstone // Sorted by Tombstone.Timestamp

I'm afraid of the memory footprint here. If a user does a lot of DELETEs on a poorly-indexed table, we may end up with a growing list of tombstones to maintain. We should be able to remove these values from memory and reload them from disk when necessary, probably via a store-level cache instead of a range-level one (like the raftEntryCache and the timestampCache)


docs/RFCS/fast_delete_range.md, line 181 at r1 (raw file):

- Consult the most recent `FullRangeTombstone` value; if an intent,
return `WriteIntentError`. If the mutation timestamp is older than

Reads must also return WriteIntentError if there is a FullRangeTombstone intent, right?


docs/RFCS/fast_delete_range.md, line 189 at r1 (raw file):

intents in case this is a busy Range that could otherwise never become
sufficiently quiescent to accept a full-Range tombstone.
- [**For full-Range tombstones ONLY**]: if the `MVCCStats.LastUpdated`

I don't think this can happen; a DeleteRange must block all other writes in the command queue so it can only run when it is going to be the most recent update to the range.


docs/RFCS/fast_delete_range.md, line 204 at r1 (raw file):

history for readers.

Blind puts should be disabled in the event there are any full-Range

So after point-in-time recovery has been used, writes will be slower by X% for the next day because of the loss of this optimization? That's unfortunate (but not critical)


docs/RFCS/fast_delete_range.md, line 208 at r1 (raw file):

possible key in the range.

## Range Garbage Collection

s/Range/MVCC/? "Range GC" sounds too much like "replica GC".


docs/RFCS/fast_delete_range.md, line 228 at r1 (raw file):

avoid any per-key manipulation over the entire lifecycle of `DROP
TABLE`. Implementing this will require a range-based extension to the
`GC` KV command.

It will also require some tricky synchronization to ensure that it doesn't race with any other writes.


docs/RFCS/fast_delete_range.md, line 238 at r1 (raw file):

to the range. On merges, the greater of the two values is chosen.
On splits, the value is simply duplicated; it does not require a
scan to update exactly. This is a minimal change.

MVCCStats is serialized below raft, which means making changes to it in an upgrade-safe way is challenging. This will need a migration plan of its own, unless we can get rid of this field. It looks like it's used to prevent full scans in two places, one of which I think is redundant with the command queue. We might be able to come up with a way to prevent the scan in the GC pipeline without introducing this intrusive change.


docs/RFCS/fast_delete_range.md, line 254 at r1 (raw file):

temporarily out-of-date stats after a point-in-time recovery. If the
definition of "temporary" is stretched initially to mean "until the
next split, maybe never", that's an OK start. Expect the queue to be

I think I'm OK with leaving these stats estimated for an indefinite period of time. We don't do anything important with the "live" stats.


docs/RFCS/fast_delete_range.md, line 276 at r1 (raw file):

In these cases, if merge pressure is high, the full-Range tombstones
could themselves be merged into the underlying data, creating
individual tombstones. This will be left as a suggested TODO.

It's fine to leave merge-related work to the (distant) future.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 3, 2017

Review status: 0 of 2 files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 180 at r1 (raw file):

just when writing data to the underlying Range.

- Consult the most recent `FullRangeTombstone` value; if an intent,

PropEvalKV rules require every operation to declare a read dependency on this key. We might be able to special-case it because this value is only written when there's a full-range write, but there are subtleties to work out. Range-local data (like the range descriptor) would not be covered by the full-range write span (but maybe that's OK because they shouldn't consult full-range tombstones anyway?)


docs/RFCS/fast_delete_range.md, line 181 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Reads must also return WriteIntentError if there is a FullRangeTombstone intent, right?

Also, intents written by the current transaction must be processed instead of returning WriteIntentError. Since this key behaves differently from other keys in that new versions are supposed to be merged with older versions, some care must be taken when a transaction contains multiple DeleteRange or recovery commands.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 2 files reviewed at latest revision, 20 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 89 at r1 (raw file):

only for two phase distributed transactions, although this is the
bedrock expectation for a `DROP` or `TRUNCATE` of anything but a
trivially small table.

I recall seeing there were a number of conditions which prevented the use of the DeleteRange fast path when dropping/truncating a table. Interleaved tables was one condition. I think foreign keys were another. It would be worthwhile to enumerate when this optimization would work and under what conditions it wouldn't. Cc @dt.


docs/RFCS/fast_delete_range.md, line 111 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What sort of overlaps are allowed between Tombstones? Full deletions (which always start with EndTimestamp zero, I assume?) must be allowed to overlap previous point-in-time recoveries. But you must not be allowed to recover into a timestamp range that was affected by another recovery or deletion.

It was not obvious to me on my first read of this RFC that Timestamp > EndTimestamp. I assumed from the names that EndTimestamp would be the higher one. I think I've adjusted my comments now that I've understood this, but if something looks backwards that's probably why. Maybe EndTimestamp should be renamed StartTimestamp?

I found the timestamp definitions here confusing as well. I think this would be simpler if the tombstone always defined a closed span of time:

type Tombstone struct {
  StartTimestamp hlc.Timestamp
  EndTimestamp  hlc.Timestamp
}

If you want to delete a range you add a tombstone from [0, now]. If you want to revert a range to a previous time t you add a tombstone from [t, now].


docs/RFCS/fast_delete_range.md, line 295 at r1 (raw file):

However, this mechanism wouldn't account for the case of `TRUNCATE
TABLE` or point-in-time recovery. It also introduces a new workflow

@dt or someone else said that we could possibly implement TRUNCATE TABLE by allocating a new table ID and propagating to dependent tables. Backup/restore has to do something like this already.

Independent of this RFC, it would be nice to figure out a way to unify DROP TABLE and TRUNCATE TABLE. That the latter takes place in a single transaction is a severe limitation.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 5, 2017

Reviewed 2 of 2 files at r1.
Review status: all files reviewed at latest revision, 23 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 111 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I found the timestamp definitions here confusing as well. I think this would be simpler if the tombstone always defined a closed span of time:

type Tombstone struct {
  StartTimestamp hlc.Timestamp
  EndTimestamp  hlc.Timestamp
}

If you want to delete a range you add a tombstone from [0, now]. If you want to revert a range to a previous time t you add a tombstone from [t, now].

Same confusion here.


docs/RFCS/fast_delete_range.md, line 184 at r1 (raw file):

the most recent full-Range tombstone, return `WriteTooOldError`.
- [**For full-Range tombstones ONLY**]: if `MVCCStats.IntentCount` is
non-zero, revert to a traditional `DeleteRange` MVCC operation which

Should point out that even if the MVCCStats have ContainsEstimates set, the IntentCount should always be accurate (which should be documented better).


docs/RFCS/fast_delete_range.md, line 189 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't think this can happen; a DeleteRange must block all other writes in the command queue so it can only run when it is going to be the most recent update to the range.

Most recent, yes, but not newest timestamp.


docs/RFCS/fast_delete_range.md, line 233 at r1 (raw file):

In the section above on mutations with full-Range tombstones, the
`MVCCStats.LastUpdated` field is referenced. This field doesn't yet

But we have MVCCStats.LastUpdateNanos, doesn't that work (assuming the guarantees about the value being updated being strong enough).


docs/RFCS/fast_delete_range.md, line 254 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think I'm OK with leaving these stats estimated for an indefinite period of time. We don't do anything important with the "live" stats.

It's been a hassle to deal with out-of-date stats. Any operation that relies on them must make assumptions on how they're out-of-date. So, maybe ok, for the Live stats, but good to avoid where we can.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

# Drawbacks

As always, complexity is a cause for concern. This RFC will introduce

What happens if you drop point-in-time-recovery? The driving issue here is dealing with large truncations. Point-in-time seems fashionable to add, but as far as I can tell hasn't been requested or prioritized by anyone, plus you can already jerry-rig it with AS OF SYSTEM TIME if you're willing to make a copy. While good to plan for as a potential extension, I think we should discuss tabling that for now and to avoid the additional complexity.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 23 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

What happens if you drop point-in-time-recovery? The driving issue here is dealing with large truncations. Point-in-time seems fashionable to add, but as far as I can tell hasn't been requested or prioritized by anyone, plus you can already jerry-rig it with AS OF SYSTEM TIME if you're willing to make a copy. While good to plan for as a potential extension, I think we should discuss tabling that for now and to avoid the additional complexity.

Point in time recovery has been requested from a number of folks: #16161.


Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Jun 5, 2017

Review status: all files reviewed at latest revision, 23 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Point in time recovery has been requested from a number of folks: #16161.

Point in time recovery also addresses a HUGE hole in our current backup/restore story. I don't consider it optional


Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Jun 5, 2017

Review status: all files reviewed at latest revision, 26 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 8 at r1 (raw file):

- Cockroach Issue: [#2003](https://github.com/cockroachdb/cockroach/issues/2003) [#14120](https://github.com/cockroachdb/cockroach/issues/14120) [#14279](https://github.com/cockroachdb/cockroach/issues/14279) [#15723](https://github.com/cockroachdb/cockroach/issues/15723) [#16161](https://github.com/cockroachdb/cockroach/issues/16161)

# Summary

Dunno if this is helpful, but here's the summary I wrote when I first started thinking about this rfc: "This RFC describes a new kv operation for transactional bulk deletes across
spans of keys and time."


docs/RFCS/fast_delete_range.md, line 60 at r1 (raw file):

additional modifications to the SQL layer. One small but salubrious
exception is we'll remove the "chunking" that's currently done in the
case of `DROP TABLE`.

Another important motivation is that if a user is DROPing or TRUNCATEing a table because they're running out of disk and want a quick win, we currently do that by writing more to disk (a rocksdb tombstone per key). This doesn't immediately solve that problem but certainly paves the way


docs/RFCS/fast_delete_range.md, line 66 at r1 (raw file):

As mentioned in the summary, the fast `DeleteRange` modifications
introduced in this RFC are meant to also enable a new enterprise-tier
feature, point-in-time recovery. Point-in-time recovery is similar to

I've been trying to standardize on calling this piece ROLLBACK. A second piece we can build is the ability to RESTORE to system times in a backup other than the end time (subject to certain limitations), which I've been calling "point-in-time RESTORE". Our customers probably don't care about the distinctions and "point-in-time recovery" sounds to me like a fine name for the collective.

Sorry if this is too much of a bikeshed


docs/RFCS/fast_delete_range.md, line 89 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I recall seeing there were a number of conditions which prevented the use of the DeleteRange fast path when dropping/truncating a table. Interleaved tables was one condition. I think foreign keys were another. It would be worthwhile to enumerate when this optimization would work and under what conditions it wouldn't. Cc @dt.

The code that makes the fast path decision is https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/tablewriter.go#L547-L568


docs/RFCS/fast_delete_range.md, line 93 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There are some potential edge cases here: we automatically split at table boundaries, but most sql range operations run over index spans. That is, we have a Range with StartKey /Table/50 and EndKey /Table/51, but a full scan of that table by primary key uses StartKey /Table/50/1 and EndKey /Table/50/2. We may need to make some adjustments to make scans and table boundaries align, or to teach the system that nothing can exist between /Table/50 and /Table/50/1

For tables not involved in any interleave hierarchy, I'd imagine we'd always use [/Table/50, /Table/50.PrefixEnd()) as the bounds of DeleteSpan (this is independent of DROP, TRUNCATE, or ROLLBACK). If an entire interleave hierarchy is being acted on, it will also use this range. If only part of an interleave hierarchy is being acted on, then we can't use this command

tl;dr Not sure there's anything to do here, unless I'm missing your point


docs/RFCS/fast_delete_range.md, line 111 at r1 (raw file):

But you must not be allowed to recover into a timestamp range that was affected by another recovery or deletion.

Why is this? The semantics of ROLLBACK to me are "make the table look exactly like it did at this time". The later rollback or deletion should be irrelevant. I could see the following situation:

Write committed at t1
Write committed at t2
Rollback to t1 (committed at t3)
Rollback to t2 (committed at t4)

In which case the rollback at t4 should make the system ignore the tombstone at t3


docs/RFCS/fast_delete_range.md, line 115 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

So it's a single versioned key, which unlike all other keys operates as if its value is the union of all its non-GC'd versions?

Seems like we should flesh out the alternative then? Specifically: only the latest tombstone is consulted, and it's a merging of all the tombstone timespans for that range.


docs/RFCS/fast_delete_range.md, line 295 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

@dt or someone else said that we could possibly implement TRUNCATE TABLE by allocating a new table ID and propagating to dependent tables. Backup/restore has to do something like this already.

Independent of this RFC, it would be nice to figure out a way to unify DROP TABLE and TRUNCATE TABLE. That the latter takes place in a single transaction is a severe limitation.

Yeah, I've thought about this some and I don't see why we couldn't implement TRUNCATE by giving the table a new id


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 5, 2017

Review status: all files reviewed at latest revision, 26 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 66 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I've been trying to standardize on calling this piece ROLLBACK. A second piece we can build is the ability to RESTORE to system times in a backup other than the end time (subject to certain limitations), which I've been calling "point-in-time RESTORE". Our customers probably don't care about the distinctions and "point-in-time recovery" sounds to me like a fine name for the collective.

Sorry if this is too much of a bikeshed

ROLLBACK is already established in SQL for transactions; I'd rather not overload the term. How about REVERT?


docs/RFCS/fast_delete_range.md, line 93 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

For tables not involved in any interleave hierarchy, I'd imagine we'd always use [/Table/50, /Table/50.PrefixEnd()) as the bounds of DeleteSpan (this is independent of DROP, TRUNCATE, or ROLLBACK). If an entire interleave hierarchy is being acted on, it will also use this range. If only part of an interleave hierarchy is being acted on, then we can't use this command

tl;dr Not sure there's anything to do here, unless I'm missing your point

As long as we use the table span (in contrast to BACKUP, which always uses index-level spans), there's nothing else to do.


docs/RFCS/fast_delete_range.md, line 111 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

But you must not be allowed to recover into a timestamp range that was affected by another recovery or deletion.

Why is this? The semantics of ROLLBACK to me are "make the table look exactly like it did at this time". The later rollback or deletion should be irrelevant. I could see the following situation:

Write committed at t1
Write committed at t2
Rollback to t1 (committed at t3)
Rollback to t2 (committed at t4)

In which case the rollback at t4 should make the system ignore the tombstone at t3

I was thinking that it would be weird if deleted data suddenly got undeleted, but on reflection it should definitely be allowed. You have to be able to undo a bad point-in-time recovery.


docs/RFCS/fast_delete_range.md, line 115 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Seems like we should flesh out the alternative then? Specifically: only the latest tombstone is consulted, and it's a merging of all the tombstone timespans for that range.

Yeah, let's think that through.


docs/RFCS/fast_delete_range.md, line 189 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Most recent, yes, but not newest timestamp.

Hmm? Writes always block each other regardless of timestamp. There's no way a DeleteRange could execute when some other operation has occurred on the range at a higher timestamp.


docs/RFCS/fast_delete_range.md, line 233 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

But we have MVCCStats.LastUpdateNanos, doesn't that work (assuming the guarantees about the value being updated being strong enough).

LastUpdateNanos is not the last update to the stats, it's the last update to the "age" fields. (and it's not precise enough; it's not a full HLC timestamp)


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Point in time recovery also addresses a HUGE hole in our current backup/restore story. I don't consider it optional

The feature may be required, but we don't necessarily have to get it in this way. The comparable feature in many other databases is powered by a continuous backup of the WAL or another feed of changes (which we intend to build anyway at some point).

Getting the feature this way is better than WAL mirroring, but the complexity this introduces on the critical path is still a concern.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 26 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 8 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Dunno if this is helpful, but here's the summary I wrote when I first started thinking about this rfc: "This RFC describes a new kv operation for transactional bulk deletes across
spans of keys and time."

Incorporated.


docs/RFCS/fast_delete_range.md, line 19 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/older/newer/

Done.


docs/RFCS/fast_delete_range.md, line 23 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We have generally switched to the word "span" for this; it may help to replace lowercase "range" with "span" in this doc.

The DeleteRange command is one of the last places where we use the word "range" in this sense. I think we haven't renamed it because of fears about RPC compatibility, but I don't think that's actually a concern and we could rename it to DeleteSpan.

This paragraph should probably go up in the summary or in a new "Background" section.

Introduced use of "span" throughout and created a #Background section.


docs/RFCS/fast_delete_range.md, line 32 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This paragraph is unclear - it says these two (three?) statements are similar (to each other, apparently) then immediately starts listing differences without ever saying what they have in common. I think it's enough for the purposes of the motivation section to say that DROP and TRUNCATE use the DeleteRange command, which makes them extremely slow.

Maybe also mention that DeleteRange has gotten much slower as a consequence of propEvalKV, since it must write all affected keys into the raft log.

Done.


docs/RFCS/fast_delete_range.md, line 60 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Another important motivation is that if a user is DROPing or TRUNCATEing a table because they're running out of disk and want a quick win, we currently do that by writing more to disk (a rocksdb tombstone per key). This doesn't immediately solve that problem but certainly paves the way

That's an interesting point. I added it as an aside.


docs/RFCS/fast_delete_range.md, line 66 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

ROLLBACK is already established in SQL for transactions; I'd rather not overload the term. How about REVERT?

I'm happy enough with calling the actual SQL command REVERT, but I think "point-in-time recovery" sounds a little better from a marketing perspective than "point-in-time revert", so I'm going to leave the phrase as is in this RFC. Note that this RFC isn't covering any details of the actual point-in-time recovery; just the underlying mechanism which will make it possible.


docs/RFCS/fast_delete_range.md, line 68 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/DATABASE/SYSTEM/

Done.


docs/RFCS/fast_delete_range.md, line 89 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

The code that makes the fast path decision is https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/tablewriter.go#L547-L568

Done.


docs/RFCS/fast_delete_range.md, line 93 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

As long as we use the table span (in contrast to BACKUP, which always uses index-level spans), there's nothing else to do.

Will definitely be covering this in testing.


docs/RFCS/fast_delete_range.md, line 111 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I was thinking that it would be weird if deleted data suddenly got undeleted, but on reflection it should definitely be allowed. You have to be able to undo a bad point-in-time recovery.

Done.


docs/RFCS/fast_delete_range.md, line 115 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, let's think that through.

No, it IS an MVCC key, as mentioned above ("an additional MVCC Range-local key..."). If a read comes in for time t, then you read the value of the full-Range tombstone that existed at that time and use that to merge. As mentioned further below in the document, if the full-Range tombstone was a revert to an earlier time, this might require additional consultations from the full-Range tombstone and from the underlying KV data.

In other words, you read the "correct" tombstone that applies to your read the same way your read the underlying KV data: by looking up the closest MVCC version that is <= your transaction timestamp.

The only reason it's materialized with all versions in memory is to avoid endlessly reading the same value.

I adjusted the paragraph to make this clearer.


docs/RFCS/fast_delete_range.md, line 119 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We should probably require Timestamp to be the timestamp of the transaction that wrote this value. Otherwise I can imagine various weird things (a Timestamp in the future would leave a "black hole" that couldn't be written into until the time expires; a Timestamp in the past would leave an inconsistent mix of previously-written values in place.

Yes, it's the transaction timestamp. I've made that more explicit.


docs/RFCS/fast_delete_range.md, line 137 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm afraid of the memory footprint here. If a user does a lot of DELETEs on a poorly-indexed table, we may end up with a growing list of tombstones to maintain. We should be able to remove these values from memory and reload them from disk when necessary, probably via a store-level cache instead of a range-level one (like the raftEntryCache and the timestampCache)

Gotcha. Well after some thought about possibly just holding the most recent value, I think it'll be best to keep reading everything and put a TODO in the code to watch out for blowup here. The reason is what happens with reverts. Those will require that we have to read again. Some ideas that occurred to me to maybe read more than the first just made things complicated. Feels better to punt on the problem of too many TRUNCATEs for the time being.


docs/RFCS/fast_delete_range.md, line 180 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

PropEvalKV rules require every operation to declare a read dependency on this key. We might be able to special-case it because this value is only written when there's a full-range write, but there are subtleties to work out. Range-local data (like the range descriptor) would not be covered by the full-range write span (but maybe that's OK because they shouldn't consult full-range tombstones anyway?)

Non-MVCC reads definitely don't consult the full-Range tombstones, which is what all other range-local data is. We need to be very careful about this though.


docs/RFCS/fast_delete_range.md, line 181 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Also, intents written by the current transaction must be processed instead of returning WriteIntentError. Since this key behaves differently from other keys in that new versions are supposed to be merged with older versions, some care must be taken when a transaction contains multiple DeleteRange or recovery commands.

Yes, read must return WriteIntentError, same as writes. I added a sentence about this.

Added a comment about handling an existing full-Range tombstone intent appropriately.


docs/RFCS/fast_delete_range.md, line 184 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Should point out that even if the MVCCStats have ContainsEstimates set, the IntentCount should always be accurate (which should be documented better).

Yes, that was my assumption. Is there anything you want me to change in this RFC, or just a point you were reiterating?


docs/RFCS/fast_delete_range.md, line 189 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Hmm? Writes always block each other regardless of timestamp. There's no way a DeleteRange could execute when some other operation has occurred on the range at a higher timestamp.

We do still need this high water mark. It's not so much about concurrency, but about whether we can guarantee that when a full-Range tombstone is applied at time t, there is no existing data in the range with timestamp > t. Obviously writes come in to a range with all manner of unordered timestamps (i.e. long running txns with old timestamps, commands generated on nodes with faster or slower clocks, etc.)


docs/RFCS/fast_delete_range.md, line 204 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

So after point-in-time recovery has been used, writes will be slower by X% for the next day because of the loss of this optimization? That's unfortunate (but not critical)

Yes that's correct. Rewriting a big table with batch writes after a TRUNCATE will be slower in that the blind put optimization will be unavailable.


docs/RFCS/fast_delete_range.md, line 208 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/Range/MVCC/? "Range GC" sounds too much like "replica GC".

Done.


docs/RFCS/fast_delete_range.md, line 228 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It will also require some tricky synchronization to ensure that it doesn't race with any other writes.

Yes; I'm considering this somewhat TBD in terms of design still. Added to the Unresolved questions section.


docs/RFCS/fast_delete_range.md, line 233 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

LastUpdateNanos is not the last update to the stats, it's the last update to the "age" fields. (and it's not precise enough; it's not a full HLC timestamp)

Right. We can't use it...We really need the high water hlc timestamp for data written to the range.


docs/RFCS/fast_delete_range.md, line 238 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

MVCCStats is serialized below raft, which means making changes to it in an upgrade-safe way is challenging. This will need a migration plan of its own, unless we can get rid of this field. It looks like it's used to prevent full scans in two places, one of which I think is redundant with the command queue. We might be able to come up with a way to prevent the scan in the GC pipeline without introducing this intrusive change.

I don't think we can get rid of it. I'll add this to the unresolved questions section.


docs/RFCS/fast_delete_range.md, line 254 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

It's been a hassle to deal with out-of-date stats. Any operation that relies on them must make assumptions on how they're out-of-date. So, maybe ok, for the Live stats, but good to avoid where we can.

We should clean them up...shouldn't be too hard.


docs/RFCS/fast_delete_range.md, line 276 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's fine to leave merge-related work to the (distant) future.

Done.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The feature may be required, but we don't necessarily have to get it in this way. The comparable feature in many other databases is powered by a continuous backup of the WAL or another feed of changes (which we intend to build anyway at some point).

Getting the feature this way is better than WAL mirroring, but the complexity this introduces on the critical path is still a concern.

I added this as a drawback, but I'm comfortable with the additional complexity for point-in-time recovery. If that's all we were getting from this mechanism, I might be less sanguine. But because we're already adding these full-Range tombstones for deletion / truncation, I think the extra bit for revert is easily worth the cost to do now and solve them all in one go.


docs/RFCS/fast_delete_range.md, line 295 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Yeah, I've thought about this some and I don't see why we couldn't implement TRUNCATE by giving the table a new id

Doing a new ID for truncation would be great, because it would re-enable the blind put optimization. However, I don't understand the potential problems you'll experience when doing time travel queries.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/delete-range-rfc branch from 25b4274 to 37c6f96 Compare June 5, 2017 23:06
@a-robinson
Copy link
Contributor

Reviewed 1 of 2 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 23 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 115 at r2 (raw file):

active only in situations where there are:
- No interleaved sub-tables
- Active indexes on the table (only applies to `TRUNCATE`)

For an active index, couldn't we just also send DeleteRange requests to the index ranges?


docs/RFCS/fast_delete_range.md, line 116 at r2 (raw file):

  • Active indexes on the table (only applies to TRUNCATE)
  • Foreign keys referencing the table

You either need to start these two phrases with the word "no" or move the word "no" from the first bullet into the preceding sentence.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 6, 2017

Review status: all files reviewed at latest revision, 19 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 66 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I'm happy enough with calling the actual SQL command REVERT, but I think "point-in-time recovery" sounds a little better from a marketing perspective than "point-in-time revert", so I'm going to leave the phrase as is in this RFC. Note that this RFC isn't covering any details of the actual point-in-time recovery; just the underlying mechanism which will make it possible.

+1 to "point-in-time recovery" as the feature name but REVERT as the command. (I'd be OK with RECOVER as the command too if we want the terminology to be more consistent, but REVERT sounds better to me)


docs/RFCS/fast_delete_range.md, line 115 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

No, it IS an MVCC key, as mentioned above ("an additional MVCC Range-local key..."). If a read comes in for time t, then you read the value of the full-Range tombstone that existed at that time and use that to merge. As mentioned further below in the document, if the full-Range tombstone was a revert to an earlier time, this might require additional consultations from the full-Range tombstone and from the underlying KV data.

In other words, you read the "correct" tombstone that applies to your read the same way your read the underlying KV data: by looking up the closest MVCC version that is <= your transaction timestamp.

The only reason it's materialized with all versions in memory is to avoid endlessly reading the same value.

I adjusted the paragraph to make this clearer.

Is it true that only closest MVCC version <= your transaction timestamp can affect your read? I don't think so:

  • Write regular values to some key at t1, t2, t3
  • At t4, issue a point-in-time revert to t2, writing a range tombstone from t2 to t4
  • At t6, issue a point-in-time revert to t5, writing a range tombstone from t5 to t6
  • Read at t6. The most recent tombstone is t5 to t6, which would allow us to incorrectly read the value at t3. We have to combine this tombstone with the previous one.

docs/RFCS/fast_delete_range.md, line 180 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Non-MVCC reads definitely don't consult the full-Range tombstones, which is what all other range-local data is. We need to be very careful about this though.

The range descriptor is an MVCC range-local key. They're the only ones currently, but they're not categorically prohibited.


docs/RFCS/fast_delete_range.md, line 189 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

We do still need this high water mark. It's not so much about concurrency, but about whether we can guarantee that when a full-Range tombstone is applied at time t, there is no existing data in the range with timestamp > t. Obviously writes come in to a range with all manner of unordered timestamps (i.e. long running txns with old timestamps, commands generated on nodes with faster or slower clocks, etc.)

OK, I get it now.


docs/RFCS/fast_delete_range.md, line 295 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Doing a new ID for truncation would be great, because it would re-enable the blind put optimization. However, I don't understand the potential problems you'll experience when doing time travel queries.

Time travel queries do a time-travel read of the descriptors, so they should see the old ID.


docs/RFCS/fast_delete_range.md, line 112 at r2 (raw file):

key in the span and adding individual tombstones.

Note that the fast path for `DROP` and `TRUNCATE` SQL commands is

These are pretty severe restrictions. It's outside the scope of this RFC but we definitely want to be able to quickly drop tables with secondary indexes and entire subtrees of interleaved tables in the future.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/delete-range-rfc branch from 37c6f96 to 8ef8879 Compare June 6, 2017 20:51
@spencerkimball
Copy link
Member Author

Review status: 1 of 2 files reviewed at latest revision, 19 unresolved discussions.


docs/RFCS/fast_delete_range.md, line 115 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is it true that only closest MVCC version <= your transaction timestamp can affect your read? I don't think so:

  • Write regular values to some key at t1, t2, t3
  • At t4, issue a point-in-time revert to t2, writing a range tombstone from t2 to t4
  • At t6, issue a point-in-time revert to t5, writing a range tombstone from t5 to t6
  • Read at t6. The most recent tombstone is t5 to t6, which would allow us to incorrectly read the value at t3. We have to combine this tombstone with the previous one.

What I meant was that you use the closest MVCC version <= cmd/txn timestamp on the first read. BUT, each revert tombstone that is merged with the result read from MVCC requires that another read be done (for both MVCC and full-Range tombstone) at the revert tombstone's start time.

So in your example, when reading at t6, we'd be affected by the tombstone to t5, then read at t5, which would be affected by the tombstone to t2, then read at t2.

This is described earlier in the document.


docs/RFCS/fast_delete_range.md, line 180 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The range descriptor is an MVCC range-local key. They're the only ones currently, but they're not categorically prohibited.

Correct. It's going to require a comprehensive comment in the code.


docs/RFCS/fast_delete_range.md, line 295 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Time travel queries do a time-travel read of the descriptors, so they should see the old ID.

Doc updated.


docs/RFCS/fast_delete_range.md, line 112 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

These are pretty severe restrictions. It's outside the scope of this RFC but we definitely want to be able to quickly drop tables with secondary indexes and entire subtrees of interleaved tables in the future.

In reality, for both DROP and TRUNCATE, I expect all interleaved tables and secondary indexes to be dropped in tandem as a result of ON DELETE CASCADE. Why keep the interleaved tables? Makes no sense. I'm pretty sure, though not positive, that that's assumed for Google's interleaved tables with Spanner. Secondary indexes is same thing.

For Oracle, dropping a table will automatically and without exception drop its secondary indexes.

So I think in practice only global, non-interleaved tables with foreign key references will not follow the fast path, but those will probably be rare.


docs/RFCS/fast_delete_range.md, line 115 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

For an active index, couldn't we just also send DeleteRange requests to the index ranges?

Yes.


docs/RFCS/fast_delete_range.md, line 116 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…
  • Active indexes on the table (only applies to TRUNCATE)
  • Foreign keys referencing the table

You either need to start these two phrases with the word "no" or move the word "no" from the first bullet into the preceding sentence.

Done.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 6, 2017

Review status: 1 of 2 files reviewed at latest revision, 19 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 115 at r1 (raw file):

BUT, each revert tombstone that is merged with the result read from MVCC requires that another read be done

Yes, that's exactly what I mean about this key being treated differently from all other MVCC keys. I see now that this is described earlier, but it took several passes for me to connect the dots.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

This PR is entering the final comments period.

@tbg
Copy link
Member

tbg commented Jun 7, 2017

Review status: 1 of 2 files reviewed at latest revision, 21 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 111 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

Can you explicitly walk through some time travel queries in your main example? In particular, it seems that without proper precautions, it could happen that you run a time travel query that aggregates over two ranges. When it reads the first range, the fast DeleteRange hasn't run yet. Before time travel touches the second, the DeleteRange runs on both ranges, and so on the second range the time travel query doesn't see data that's consistent with what it read on the first range.

I also wonder how you're undoing a PITR as Ben pointed at above. You're dealing with tombstones only, but it seems that you'd also want an Un-Tombstone, though that's sure to make things more complex. What was your answer to him there?


docs/RFCS/fast_delete_range.md, line 184 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Yes, that was my assumption. Is there anything you want me to change in this RFC, or just a point you were reiterating?

We should document this better at some point, but no specific action required here.


docs/RFCS/fast_delete_range.md, line 233 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Right. We can't use it...We really need the high water hlc timestamp for data written to the range.

I see. So you're only missing the logical component - I think LastUpdateNanos is the high water mark wall time component because we always AgeTo on every write. I haven't double checked that though, but those are definitely the semantics this field should have.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I added this as a drawback, but I'm comfortable with the additional complexity for point-in-time recovery. If that's all we were getting from this mechanism, I might be less sanguine. But because we're already adding these full-Range tombstones for deletion / truncation, I think the extra bit for revert is easily worth the cost to do now and solve them all in one go.

I'm on board with point-in-time recovery being a useful addition to the system and I see the allure of two birds with one stone. I have no concerns about RFC'ing the complete design. But I'm worried about making this a one-man show trying to implement it all in a fell swoop for 1.1 as there is a lot of test code to write. If I picture the worst case, then I see you rushing this feature in just before 1.1 closes (perhaps with CEO persuasion), and later others will have to go in and fix any bugs that crop up in production after painfully long investigation and limited window for a fix. In my ideal case we make this two-stage, you loop someone in who's not a member of the old wise men core team and work with them to get a solid solution for large drops into 1.1, and then continue with PITR for 1.2 (or 1.1, if things are much faster than anticipated). Of course there's also the OK case in which you do all the work and it sorta works out OK, perhaps great. But even then I'd love for someone new to grok this, too (similar how Dan got thrown into the storage bowels via backup-restore through Ben and myself). Curious to hear what others think.


docs/RFCS/fast_delete_range.md, line 51 at r3 (raw file):

additional modifications to the SQL layer. One small but salubrious
exception is we'll remove the "chunking" that's currently done in the
case of `DROP TABLE`.

Should make clear somewhere that the optimization only applies when there is no limit and no deleted keys are returned (mostly so that it's not forgotten about during implementation). Also, anything special about the first range? It contains a bunch of non-userspace keys. I assume we simply won't do this optimization for the first range, ever. (that should be OK since we split out the table anyway).


docs/RFCS/fast_delete_range.md, line 227 at r3 (raw file):

- Consult the most recent `FullRangeTombstone` value; if an intent,
return `WriteIntentError`. Note that the same rules which apply to

This might tickle some new code paths because I don't think there's any other case in which you can get a WriteIntentError which isn't one of the keys you're writing. It should work though.


docs/RFCS/fast_delete_range.md, line 280 at r3 (raw file):

avoid any per-key manipulation over the entire lifecycle of `DROP
TABLE`. Implementing this will require a span-based extension to the
`GC` KV command.

Seems good to ping #7880.


docs/RFCS/fast_delete_range.md, line 336 at r3 (raw file):

tombstones with the underlying Range MVCC values.

# Alternatives #

Spurious #


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: 1 of 2 files reviewed at latest revision, 21 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 111 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Can you explicitly walk through some time travel queries in your main example? In particular, it seems that without proper precautions, it could happen that you run a time travel query that aggregates over two ranges. When it reads the first range, the fast DeleteRange hasn't run yet. Before time travel touches the second, the DeleteRange runs on both ranges, and so on the second range the time travel query doesn't see data that's consistent with what it read on the first range.

I also wonder how you're undoing a PITR as Ben pointed at above. You're dealing with tombstones only, but it seems that you'd also want an Un-Tombstone, though that's sure to make things more complex. What was your answer to him there?

Your time travel query question is not a problem. To see why, consider the following example: Let's say the time travel is at time t5. At your first range, the query to any key in that range will ensure that no full-Range write can possibly be written at time <= t5 (because of the timestamp cache). When the fast DeleteRange now runs before the second range is queried at t5, it must have a tombstone timestamp > t5. That means that when the second range is queried at t5, the fast DeleteRange cannot possibly affect the time travel read.

Dealing with "undoing" a PITR was not contemplated in this document. However, it's fairly simple to see how they would work. If a revert is done from t6 to t4, doing a revert from t6,1 to t6,-1 would effectively undo the revert from t6. Consider the case of writes at t5:

  • revert from t6 to t4
  • revert from t6,1 to t6,-1
  • read at t7 will get MVCC t5, & most recent revert tombstone <= t7, which is the revert from t6,1; since there are no tombstones with timestamp < t6,-1, return t5

docs/RFCS/fast_delete_range.md, line 184 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

We should document this better at some point, but no specific action required here.

I will add to the .proto comments.


docs/RFCS/fast_delete_range.md, line 233 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I see. So you're only missing the logical component - I think LastUpdateNanos is the high water mark wall time component because we always AgeTo on every write. I haven't double checked that though, but those are definitely the semantics this field should have.

Well yes, I suppose that's accurate. The LastUpdateNanos should be the high water mark wall clock time. Although I'm not sure because a quick survey of the code is making it very difficult to determine where that value is being set!

In any case, I think it will be easier and less confusing to add another field and leave LastUpdateNanos independent and meant to apply only to aging the gc-able bytes.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm on board with point-in-time recovery being a useful addition to the system and I see the allure of two birds with one stone. I have no concerns about RFC'ing the complete design. But I'm worried about making this a one-man show trying to implement it all in a fell swoop for 1.1 as there is a lot of test code to write. If I picture the worst case, then I see you rushing this feature in just before 1.1 closes (perhaps with CEO persuasion), and later others will have to go in and fix any bugs that crop up in production after painfully long investigation and limited window for a fix. In my ideal case we make this two-stage, you loop someone in who's not a member of the old wise men core team and work with them to get a solid solution for large drops into 1.1, and then continue with PITR for 1.2 (or 1.1, if things are much faster than anticipated). Of course there's also the OK case in which you do all the work and it sorta works out OK, perhaps great. But even then I'd love for someone new to grok this, too (similar how Dan got thrown into the storage bowels via backup-restore through Ben and myself). Curious to hear what others think.

These are valid concerns, but outside the scope of this RFC PR. The plan is to get the fast delete path working first. Then the PITR will need to be added as an enterprise feature with new SQL syntax and probably another RFC. These concerns will be amply addressed at that later date.


docs/RFCS/fast_delete_range.md, line 51 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Should make clear somewhere that the optimization only applies when there is no limit and no deleted keys are returned (mostly so that it's not forgotten about during implementation). Also, anything special about the first range? It contains a bunch of non-userspace keys. I assume we simply won't do this optimization for the first range, ever. (that should be OK since we split out the table anyway).

Added explicit mention of limit and returned keys exceptions.

I don't believe there is any reason to make an exception for the first range. The first range isn't the only one that contains non-userspace keys. All the time series data, all the meta2 records, node liveness, etc. They're all potentially located on ranges after the first. I don't think we should do anything different for system keyspace ranges. This faster version for full-Range DeleteRange should work the same way that the existing DeleteRange works now.


docs/RFCS/fast_delete_range.md, line 227 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This might tickle some new code paths because I don't think there's any other case in which you can get a WriteIntentError which isn't one of the keys you're writing. It should work though.

It will return a WriteIntentError for the key you're trying to write. As you say, I don't think this is a problem.


docs/RFCS/fast_delete_range.md, line 280 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Seems good to ping #7880.

Looks like this comment accomplished that!


docs/RFCS/fast_delete_range.md, line 336 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Spurious #

Done.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/delete-range-rfc branch from 8ef8879 to e840381 Compare June 7, 2017 19:21
@tbg
Copy link
Member

tbg commented Jun 7, 2017

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 19 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 111 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Your time travel query question is not a problem. To see why, consider the following example: Let's say the time travel is at time t5. At your first range, the query to any key in that range will ensure that no full-Range write can possibly be written at time <= t5 (because of the timestamp cache). When the fast DeleteRange now runs before the second range is queried at t5, it must have a tombstone timestamp > t5. That means that when the second range is queried at t5, the fast DeleteRange cannot possibly affect the time travel read.

Dealing with "undoing" a PITR was not contemplated in this document. However, it's fairly simple to see how they would work. If a revert is done from t6 to t4, doing a revert from t6,1 to t6,-1 would effectively undo the revert from t6. Consider the case of writes at t5:

  • revert from t6 to t4
  • revert from t6,1 to t6,-1
  • read at t7 will get MVCC t5, & most recent revert tombstone <= t7, which is the revert from t6,1; since there are no tombstones with timestamp < t6,-1, return t5

Ah, gotcha. I didn't take into account that only the tombstones which are visible from your timestamp are active.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

These are valid concerns, but outside the scope of this RFC PR. The plan is to get the fast delete path working first. Then the PITR will need to be added as an enterprise feature with new SQL syntax and probably another RFC. These concerns will be amply addressed at that later date.

SGTM


docs/RFCS/fast_delete_range.md, line 51 at r3 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Added explicit mention of limit and returned keys exceptions.

I don't believe there is any reason to make an exception for the first range. The first range isn't the only one that contains non-userspace keys. All the time series data, all the meta2 records, node liveness, etc. They're all potentially located on ranges after the first. I don't think we should do anything different for system keyspace ranges. This faster version for full-Range DeleteRange should work the same way that the existing DeleteRange works now.

For example, we are going through some trouble to make sure there aren't ever intents on the node liveness table (right?). An intent on the DeleteRange key would behave like an intent on any of the liveness table keys. Does seem that at the very least, it exposes us to a lot of new situations.


docs/RFCS/fast_delete_range.md, line 163 at r4 (raw file):

transaction or non-transactional command's timestamp. Also,
`StartTimestamp` for a point-in-time recovery tombstone must be more
recent than the Range's `GCThreshold`.

I think this warrants an additional couple of words. Say you're anchored with your transaction far away from the range deletions, and when you've done those, you come back for the commit. How are you going make any guarantees about what the GCThreshold is on these ranges? It's the GC Queue's duty to make sure things work out in that case. So probably easiest to discuss it there and mention here only that at the time of writing the intent you fail immediately if you're already violating the GC Threshold (or TxnSpanGCThreshold, as usual).


docs/RFCS/fast_delete_range.md, line 266 at r4 (raw file):

Because MVCC garbage collection scans through keys for GC'able keys
using the `MVCCIterate` command, it benefits from the work done at the
MVCC level to merge in the full-Range tombstones. However, it makes

Is that really true? The tombstone layer makes it tricky to follow what happens. The GC Queue currently iterates over actual on-disk KV pairs but with tombstones, wouldn't you skip some? For example, if you have gcable kvs at t=1, 5, 10 and you have a tombstone for 3-6, wouldn't it look like only t=1,10 had kv pairs and you'd never delete t=5? Perhaps an example can clarify.

Also worth discussing how a tombstone itself is GC'ed and how it influences what other values can be GC'ed. For example, a write at t=1, a deletion of that write at t=10, and a range tombstone 8-11. So t=1 is visible forever, but will we ever delete the deletion (which could be a lot of other operations on top of each other as well) at t=10? I guess that again boils down to how the iteration over KV pairs works then. You could remove a tombstone if no KV pairs are within its enclosed time interval. And you could remove any KV pairs that are shadowed by a tombstone (at least until un-deletion is a thing). Hope the MVCC work really solves all of these!


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 7, 2017

Review status: all files reviewed at latest revision, 17 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 233 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Well yes, I suppose that's accurate. The LastUpdateNanos should be the high water mark wall clock time. Although I'm not sure because a quick survey of the code is making it very difficult to determine where that value is being set!

In any case, I think it will be easier and less confusing to add another field and leave LastUpdateNanos independent and meant to apply only to aging the gc-able bytes.

It will be cleaner to add a new field, but much trickier due to the migration requirements.

Do we AgeTo on every write? That wasn't clear to me. As long as we do, I think we can work around the loss of precision by returning WriteTooOldError when a DeleteRange is proposed with the same HLC walltime component as an existing write. That may be a false positive and lead to spurious retries, but I think we can live with that.


docs/RFCS/fast_delete_range.md, line 51 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

For example, we are going through some trouble to make sure there aren't ever intents on the node liveness table (right?). An intent on the DeleteRange key would behave like an intent on any of the liveness table keys. Does seem that at the very least, it exposes us to a lot of new situations.

Yeah, but a DeleteRange that covers the node liveness keys would write intents to them with or without this optimization. Not doing a DeleteRange over the liveness keys is just part of what it means to "go through some trouble to make sure there aren't ever intents on the node liveness table". I don't think this needs a special case.


docs/RFCS/fast_delete_range.md, line 163 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think this warrants an additional couple of words. Say you're anchored with your transaction far away from the range deletions, and when you've done those, you come back for the commit. How are you going make any guarantees about what the GCThreshold is on these ranges? It's the GC Queue's duty to make sure things work out in that case. So probably easiest to discuss it there and mention here only that at the time of writing the intent you fail immediately if you're already violating the GC Threshold (or TxnSpanGCThreshold, as usual).

This is kind of scary: a GC queue run during a REVERT transaction could lead to inconsistent data. I think this warrants a check in the GC queue, to not do any MVCC GC while an intent exists for a full-range tombstone (or potentially to look into the intent and incorporate it into the decisions about what is GC'able, but it's simpler and safer to just wait for the intent to resolve)


docs/RFCS/fast_delete_range.md, line 362 at r4 (raw file):

the scope of this RFC.

- Because `MVCCStats` is serialized below Raft, it will require an

I think the migration story belongs in the RFC; I wouldn't want to approve it without one. (I think it's aggressive to even send it to the final comment period without this).


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 7, 2017

Review status: all files reviewed at latest revision, 17 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 51 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, but a DeleteRange that covers the node liveness keys would write intents to them with or without this optimization. Not doing a DeleteRange over the liveness keys is just part of what it means to "go through some trouble to make sure there aren't ever intents on the node liveness table". I don't think this needs a special case.

Step 1: make cluster
Step 2: make table
Step 3: for some reason, things don't split the way they're supposed to
Step 4: DROP TABLE foo;
Step 5: synthesized intents everywhere.

I just think it'd be nice to a) point out why it's a problem and b) make sure it's not becoming a problem. Sure, we own the KV store, but at the end who's to know what's allowed and what isn't? It's already anyone's guess today unless you're willing to think hard thoughts. Remember how we introduced node liveness under the assumption that there could never be intents on it, but then found than in practice intents could still be on it? This could be similar, I just want to make sure we make this thorough, especially since it should make sense to anyone not deep in the weeks of local/range-local/system keyspaces. (I find it confusing enough already).


docs/RFCS/fast_delete_range.md, line 362 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think the migration story belongs in the RFC; I wouldn't want to approve it without one. (I think it's aggressive to even send it to the final comment period without this).

I second the concern about this already being in final comment period. That's a good hammer to get folks to actually look at it (sure worked for me), but at least I personally don't feel that I have grokked the suggestions here well enough to stamp them.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 7, 2017

Review status: all files reviewed at latest revision, 17 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 51 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Step 1: make cluster
Step 2: make table
Step 3: for some reason, things don't split the way they're supposed to
Step 4: DROP TABLE foo;
Step 5: synthesized intents everywhere.

I just think it'd be nice to a) point out why it's a problem and b) make sure it's not becoming a problem. Sure, we own the KV store, but at the end who's to know what's allowed and what isn't? It's already anyone's guess today unless you're willing to think hard thoughts. Remember how we introduced node liveness under the assumption that there could never be intents on it, but then found than in practice intents could still be on it? This could be similar, I just want to make sure we make this thorough, especially since it should make sense to anyone not deep in the weeks of local/range-local/system keyspaces. (I find it confusing enough already).

DROP TABLE foo doesn't say DeleteRange(range 1), it says DeleteRange(/Table/50, /Table/51). It's the storage layer that checks the keys passed in, validates them against the range boundaries, and decides whether to write a full-range tombstone or use the iteration-based DeleteRange and write per-key tombstones.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 7, 2017

Review status: all files reviewed at latest revision, 17 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 51 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

DROP TABLE foo doesn't say DeleteRange(range 1), it says DeleteRange(/Table/50, /Table/51). It's the storage layer that checks the keys passed in, validates them against the range boundaries, and decides whether to write a full-range tombstone or use the iteration-based DeleteRange and write per-key tombstones.

You don't have to convince me it's correct for that particular use case we have in mind right now, but we've done remarkably little for other team members to come around and to know what's acceptable and what's not (after all, other team members == set of people who use KV). What I'm trying to do here is to get a little more detail in somewhere. Perhaps this isn't the right place, but we should document internals whenever we can. For example, why not use this operation to later purge time series ranges? Because those use merges. I don't know how it works for you, but every time I have to interact with local keys in any way there is considerable effort in paging in the messy details. I would love to be able to come back to this RFC at least find some information or pointers on how fast DeleteRange works outside of user-land if that ever becomes relevant.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/delete-range-rfc branch from e840381 to 19cdbf0 Compare June 7, 2017 22:15
@spencerkimball
Copy link
Member Author

Review status: 1 of 2 files reviewed at latest revision, 17 unresolved discussions.


docs/RFCS/fast_delete_range.md, line 233 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It will be cleaner to add a new field, but much trickier due to the migration requirements.

Do we AgeTo on every write? That wasn't clear to me. As long as we do, I think we can work around the loss of precision by returning WriteTooOldError when a DeleteRange is proposed with the same HLC walltime component as an existing write. That may be a false positive and lead to spurious retries, but I think we can live with that.

OK, I'll switch to using this suggestion. Will accompany it with a significant comment in the mvcc.proto file around LastUpdateNanos.


docs/RFCS/fast_delete_range.md, line 51 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You don't have to convince me it's correct for that particular use case we have in mind right now, but we've done remarkably little for other team members to come around and to know what's acceptable and what's not (after all, other team members == set of people who use KV). What I'm trying to do here is to get a little more detail in somewhere. Perhaps this isn't the right place, but we should document internals whenever we can. For example, why not use this operation to later purge time series ranges? Because those use merges. I don't know how it works for you, but every time I have to interact with local keys in any way there is considerable effort in paging in the messy details. I would love to be able to come back to this RFC at least find some information or pointers on how fast DeleteRange works outside of user-land if that ever becomes relevant.

I agree in principal but this concern seems misplaced. Why would we ever expect to receive a DeleteRange for a system keyspace range?? Whether or not we have the fast path implementation described here, that would be the end of your database. This doesn't feel like an appropriate place to put a check in for that kind of insanity.


docs/RFCS/fast_delete_range.md, line 163 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is kind of scary: a GC queue run during a REVERT transaction could lead to inconsistent data. I think this warrants a check in the GC queue, to not do any MVCC GC while an intent exists for a full-range tombstone (or potentially to look into the intent and incorporate it into the decisions about what is GC'able, but it's simpler and safer to just wait for the intent to resolve)

The GC Queue cannot proceed with the GC until it has "pushed" the transaction responsible for any intent on the full-Range tombstone key. That means that the GCThreshold for the range cannot change until after the transaction which created the full-Range tombstone has completed.

I've added a little more verbiage about an immediate failure if the addition of the revert tombstone would revert to an earlier timestamp than the GCThreshold.


docs/RFCS/fast_delete_range.md, line 266 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Is that really true? The tombstone layer makes it tricky to follow what happens. The GC Queue currently iterates over actual on-disk KV pairs but with tombstones, wouldn't you skip some? For example, if you have gcable kvs at t=1, 5, 10 and you have a tombstone for 3-6, wouldn't it look like only t=1,10 had kv pairs and you'd never delete t=5? Perhaps an example can clarify.

Also worth discussing how a tombstone itself is GC'ed and how it influences what other values can be GC'ed. For example, a write at t=1, a deletion of that write at t=10, and a range tombstone 8-11. So t=1 is visible forever, but will we ever delete the deletion (which could be a lot of other operations on top of each other as well) at t=10? I guess that again boils down to how the iteration over KV pairs works then. You could remove a tombstone if no KV pairs are within its enclosed time interval. And you could remove any KV pairs that are shadowed by a tombstone (at least until un-deletion is a thing). Hope the MVCC work really solves all of these!

You're right this is more complicated than just using MVCCIterate. I looked again at the code in gc_queue.go and we only use MVCCIterate for the local key range – so I've removed the original paragraph from the RFC and replaced it with a better thought out plan.


docs/RFCS/fast_delete_range.md, line 362 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I second the concern about this already being in final comment period. That's a good hammer to get folks to actually look at it (sure worked for me), but at least I personally don't feel that I have grokked the suggestions here well enough to stamp them.

Removed.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: all files reviewed at latest revision, 24 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

I'm comfortable with the additional complexity for point-in-time recovery

I'm not. We've uncovered a lot of subtle interactions (mainly with GC) while discussing this, and I don't think we're done yet. If concerns are about REVERT are outside the scope of this RFC, then let's strip the RFC down to just the DeleteRange portions.


docs/RFCS/fast_delete_range.md, line 163 at r4 (raw file):

The GC Queue cannot proceed with the GC until it has "pushed" the transaction responsible for any intent on the full-Range tombstone key.

That's not how it works today. The GC queue considers intents and old MVCC versions separately; it can GC old versions and advance the threshold even when intents exist that are not yet resolved. It's the command queue interactions on the GC threshold key that make this safe.

But in any case, my concern was misplaced above - I misremembered how the MVCC GC worked and was afraid we would allow the first version past the threshold to be GC'd incorrectly. I'm no longer sure that we need to block GC when a range tombstone intent exists.


docs/RFCS/fast_delete_range.md, line 362 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Removed.

Um, removing the "unresolved question" about the migration story doesn't resolve it. We've avoided part of the problem by not adding a new field to MVCCStats, but since this is changing the behavior of all the existing KV commands this will still be tricky to roll out. During a rolling upgrade, a DeleteRange issued on a node running a new version will write tombstones that can't be interpreted by nodes running older versions. How will we gate this so that the new tombstones aren't used until the upgrade is complete (or optionally longer, if the admin wants to retain the ability to downgrade)?

This is what I was talking about last week when I said a command like cockroach upgrade 1.1 (to indicate the completion of a rolling upgrade) was all but inevitable.


docs/RFCS/fast_delete_range.md, line 239 at r5 (raw file):

timestamp is older than the most recent full-Range tombstone, return
`WriteTooOldError`.
- [**For full-Range tombstones ONLY**]: if `MVCCStats.IntentCount` is

How do REVERTs interact with existing intents? (They can't fall back to a traditional DeleteRange)


docs/RFCS/fast_delete_range.md, line 275 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The GCQueue does not do its evaluation within Raft, so what are "the versions of the tombstone"? A new one might be written while you're doing this work and it could invalidate your assumptions. I.e. you operate on the assumption that there's only a rollback from 10 to 5, but in fact there is a new one from 10 to 0 after you started, but you go and GC things that are naked only at t<0 but not t<5. I think we need more here.

Yeah, this seems insufficient since the GCQueue is neither transactional nor synchronized with raft. As I said above I'm not sure we need this, but if we do, we need more synchronization than is described here.


docs/RFCS/fast_delete_range.md, line 277 at r5 (raw file):

versioned values to appropriately garbage collect versions older than
the threshold, but taking care to preserve older version(s) older than
the GC threshold, but still visible as a result of revert tombstones.

We need to keep versions that are hidden by revert tombstones in addition to those that are kept visible by them (thanks to the ability to revert a revert, any covered versions may be uncovered, even if they are older than the GC threshold).


docs/RFCS/fast_delete_range.md, line 279 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Can you specify what "older than" means? I take it you mean the timestamp at which the respective tombstone was written. I don't think it's so easy. If you have a tombstone written at t=10 which reverts from t=10 to t=5 then you want that to stay in effect forever. Or are you talking about a tombstone that has been deleted? Then perhaps, but that should be mentioned here (you also don't talk about deleting tombstones anywhere else, so it deserves some discussion - does anything happen there?)

And we clearly have to GC data at some point when there is a live tombstone. We can't have our GC Queue forever stuck at t=5 because there's a live tombstone that reverts from t=10 to t=5. It does want to raise the GCThreshold above the tombstone but that makes semantics more complicated - you want to keep some version below t=5 (unless there's a deletion not affected by the tombstone higher up for that key), but then you can remove things between 5 and 10, and above 10` things are back to normal again.

Would really be good to go through some of the examples here like you do for the reads, arguably this section has way more complications than reads do and so should be fleshed out more.

We always keep one version beyond the GC threshold. We don't have to hold the GC threshold back because there are old range tombstones, but we do have to keep old tombstones around as long as they cover any extant versions.


Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Jun 15, 2017

Review status: all files reviewed at latest revision, 24 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm comfortable with the additional complexity for point-in-time recovery

I'm not. We've uncovered a lot of subtle interactions (mainly with GC) while discussing this, and I don't think we're done yet. If concerns are about REVERT are outside the scope of this RFC, then let's strip the RFC down to just the DeleteRange portions.

The point in time REVERT is scheduled for 1.1 (and fills a pretty embarrassing hole in our disaster recovery story). If we decide that it's outside the scope of this rfc, then we're likely cutting it from 1.1. Are we okay with doing that?

At the very least, I propose that we include the details we need to support it in this rfc and implement them alongside the fast DROP and TRUNCATE. Then, if we're concerned about the ruggedization of REVERT when August/September rolls around, we can turn it off for the release


Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: all files reviewed at latest revision, 24 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

The point in time REVERT is scheduled for 1.1 (and fills a pretty embarrassing hole in our disaster recovery story). If we decide that it's outside the scope of this rfc, then we're likely cutting it from 1.1. Are we okay with doing that?

At the very least, I propose that we include the details we need to support it in this rfc and implement them alongside the fast DROP and TRUNCATE. Then, if we're concerned about the ruggedization of REVERT when August/September rolls around, we can turn it off for the release

I'm not sure I agree that this is a "pretty embarrassing hole" - many other databases don't have an MVCC-based rollback like this; they generally solve the point-in-time recovery problem by streaming transaction logs to a backup location, which requires an expensive restore process to actually use. Our SELECT AS OF SYSTEM TIME alternative isn't much worse in practice than what you get with other databases.


Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Jun 15, 2017

Review status: all files reviewed at latest revision, 24 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm not sure I agree that this is a "pretty embarrassing hole" - many other databases don't have an MVCC-based rollback like this; they generally solve the point-in-time recovery problem by streaming transaction logs to a backup location, which requires an expensive restore process to actually use. Our SELECT AS OF SYSTEM TIME alternative isn't much worse in practice than what you get with other databases.

Except we don't have a streaming transaction log yet, so there's no way in a data loss scenario to restore to a point after the last backup. The data loss story is what I meant by embarrassing hole, not so much the lack of an mvcc rollback. We've encountered a decent amount of concern over this from potential customers


Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: all files reviewed at latest revision, 24 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Except we don't have a streaming transaction log yet, so there's no way in a data loss scenario to restore to a point after the last backup. The data loss story is what I meant by embarrassing hole, not so much the lack of an mvcc rollback. We've encountered a decent amount of concern over this from potential customers

Yes, but data loss from what failure mode? We have synchronous replication to protect against single-node failure and time travel queries to protect against application bugs or administrative mistakes. This greatly reduces the need for a streaming backup/point-in-time-recovery mechanism. There are still cases where you'd need this (multiple hardware failures or bugs in cockroachdb itself), but I'm not sure this is severe enough to make the proposed change worth the complexity and risk.


Comments from Reviewable

@danhhz
Copy link
Contributor

danhhz commented Jun 15, 2017

Review status: all files reviewed at latest revision, 24 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

application bugs or administrative mistakes

The time it takes us to recover from these using a time travel query is pretty bad right now. And I've seen them happen often enough in the past that this matters.

complexity and risk.

Anyway, it sounds like you feel more strongly about this than I realized. Probably time to take a step back.

I can't say I have a good enough sense of core to understand why this is so risky. But if we're going to bump point in time revert from 1.1, then we should do it sooner than later. From what I understand product+sales have been pitching this as a 1.1 feature (and I'll also schedule my time differently if this is not on my plate for 1.1). Not saying that decision has to be made in this comment thread, but the stakeholders should probably talk this over


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/delete-range-rfc branch from 19cdbf0 to d163fe3 Compare June 21, 2017 22:24
@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 24 unresolved discussions, all commit checks successful.


docs/RFCS/fast_delete_range.md, line 280 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

application bugs or administrative mistakes

The time it takes us to recover from these using a time travel query is pretty bad right now. And I've seen them happen often enough in the past that this matters.

complexity and risk.

Anyway, it sounds like you feel more strongly about this than I realized. Probably time to take a step back.

I can't say I have a good enough sense of core to understand why this is so risky. But if we're going to bump point in time revert from 1.1, then we should do it sooner than later. From what I understand product+sales have been pitching this as a 1.1 feature (and I'll also schedule my time differently if this is not on my plate for 1.1). Not saying that decision has to be made in this comment thread, but the stakeholders should probably talk this over

I am starting to think of this feature as specifically about reversions, and have updated and renamed the RFC to reflect that. While we will be able to use it to dramatically improve our TRUNCATE and DROP performance for 1.1, I think a better solution long term for those will be to move such tables into a graveyard for eventual GC and start with a fresh table ID for case of TRUNCATE. This is attractive because it doesn't require per-range activity and because it gives us virgin keyspace for successive writes. This immediately reenables the BlindPut optimization, for example.


docs/RFCS/fast_delete_range.md, line 163 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The GC Queue cannot proceed with the GC until it has "pushed" the transaction responsible for any intent on the full-Range tombstone key.

That's not how it works today. The GC queue considers intents and old MVCC versions separately; it can GC old versions and advance the threshold even when intents exist that are not yet resolved. It's the command queue interactions on the GC threshold key that make this safe.

But in any case, my concern was misplaced above - I misremembered how the MVCC GC worked and was afraid we would allow the first version past the threshold to be GC'd incorrectly. I'm no longer sure that we need to block GC when a range tombstone intent exists.

This is explicitly spelled out now. We still have to push any intent on the full-Range tombstone key in order to ensure that changes to the GCThreshold are accounted for when accommodating reverts.


docs/RFCS/fast_delete_range.md, line 362 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Um, removing the "unresolved question" about the migration story doesn't resolve it. We've avoided part of the problem by not adding a new field to MVCCStats, but since this is changing the behavior of all the existing KV commands this will still be tricky to roll out. During a rolling upgrade, a DeleteRange issued on a node running a new version will write tombstones that can't be interpreted by nodes running older versions. How will we gate this so that the new tombstones aren't used until the upgrade is complete (or optionally longer, if the admin wants to retain the ability to downgrade)?

This is what I was talking about last week when I said a command like cockroach upgrade 1.1 (to indicate the completion of a rolling upgrade) was all but inevitable.

This is no longer a problem since I've introduced a new Revert KV command. The rolling upgrade to this functionality will not require any specialized migration.


docs/RFCS/fast_delete_range.md, line 133 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think it would be better to rename Timestamp as well because there is already the timestamp at which this tombstone was written, and it's inviting confusion. Talking about "the tombstone's timestamp" should be its MVCC timestamp, not Tombstone.Timestamp.

Optional bike shed suggestion: Tombstone.Revert{From,To}. where From <= To

Done.


docs/RFCS/fast_delete_range.md, line 239 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How do REVERTs interact with existing intents? (They can't fall back to a traditional DeleteRange)

Good point. On second thought, we should probably not revert to normal DeleteRange – better to instead do a scan to clear the intents retry the full-Range delete/revert.

Aside from the fact that per-key revert tombstones would be an unacceptable complication, we expect these full-Range actions on quiescent or at the very least, mostly-quiescent, tables, so I think this approach is better than gumming the works up with per-key writes.

I've changed the RFC to instead introduce a new KV command: Revert, which works only on full Ranges and returns an IllegalRevertError in the event of a partial Range invocation. The Revert command works for truncates and drops by setting RevertTo = 0.


docs/RFCS/fast_delete_range.md, line 267 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Can you make it clearer that this is a proposed change? As written, it gives the impression that this is already how the GCQueue handles intents (not true).

Done.


docs/RFCS/fast_delete_range.md, line 269 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The way GCThreshold currently works is that it there may well be GC'able keys that are below the threshold - the threshold is simply set to now-TTL. Trying to be authoritative here is hard because it means you can't ever batch these GC Requests (or rather, you have to batch them by artificially lowering the target GCThreshold until you're comfortable you can do it in a single batch).

So what would happen is that the GCThreshold gets set higher, data data you're exposing again gets (at least partially) deleted, and you do in fact return incorrect results.

With batching, we still set the GCThreshold = now-TTL on the first KV GC batch. So think of GCThreshold as a high water mark on GC done to any part of the Range. In conjunction with the check described below, where we verify that the FullRangeReversions value hasn't changed when evaluating the GC KV command, I don't see how we would return incorrect results.

Could you give a more detailed example, or else maybe stop by my desk to discuss?


docs/RFCS/fast_delete_range.md, line 275 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, this seems insufficient since the GCQueue is neither transactional nor synchronized with raft. As I said above I'm not sure we need this, but if we do, we need more synchronization than is described here.

The "versions of the tombstone" are the ones in the snapshot of the Range used in the GC queue's deliberations.

But your point is valid. The reality is we can't GC anything if a revert might be added (between GC queue and GC command) which takes us back to any arbitrary GC'able timestamp. The solution to this is to pass the most recent timestamp of FullRangeReversions and then verify it hasn't changed when we evaluate the GC command.


docs/RFCS/fast_delete_range.md, line 277 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We need to keep versions that are hidden by revert tombstones in addition to those that are kept visible by them (thanks to the ability to revert a revert, any covered versions may be uncovered, even if they are older than the GC threshold).

Yes good point. Added a section on the GC algorithm for clarity.


docs/RFCS/fast_delete_range.md, line 298 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Seems a bit misleading - if we're doing something slightly tricky with this timestamp as Ben suggested, we shouldn't turn it into a HLC timestamp in the first place (which makes it look like it's actually updated with HLC high water marks). Either way, I'm missing Ben's sidenote above :

I think we can work around the loss of precision by returning WriteTooOldError when a DeleteRange is proposed with the same HLC walltime component as an existing write. That may be a false positive and lead to spurious retries, but I think we can live with that.

OK, cleaned this up.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/delete-range-rfc branch from d163fe3 to 0277de1 Compare June 21, 2017 22:44
@bdarnell
Copy link
Contributor

Don't forget to change the PR title. (renaming the file mid-review makes the comments harder to follow; it would have been better to leave it and rename just before merging)

I'm still unsatisfied with this RFC. The justification given is still focused on speeding up DeleteRange, even though you say that the long-term solution for DeleteRange should involve replacing the table ID. If that's the goal, let's have an RFC for that approach first, and then we can consider if this might be a more achievable short-term solution. (I think it might be if we limit the mechanism to delete/revert-to-zero. I think the full reversion functionality still has some edge cases to work out and is too complicated to justify by the need for faster deletion, especially if it's going to be replaced by a different way to speed up deletes).

If the goal is fast point in time reversions, then the approach of this RFC is (probably) the right one. But in that case at least the "motivation" section needs to be rewritten to justify this feature on its own terms instead of dwelling on the limitations of DeleteRange. I am not convinced that this feature is high enough priority to warrant the high-risk work it involves, and the opportunity costs of the core team's time.


Review status: 0 of 2 files reviewed at latest revision, 28 unresolved discussions, all commit checks successful.


docs/RFCS/revert.md, line 13 at r6 (raw file):

transactional truncations and point-in-time recoveries of full
Ranges. This supports fast, transactional `DROP TABLE`, `DROP INDEX`,
and `TRUNCATE TABLE` SQL statements, as well as an as-yet-unspecified

Now that this is all about revert, what is left unspecified? Just the SQL syntax?


docs/RFCS/revert.md, line 30 at r6 (raw file):

# Motivation

The primary goal for the `Revert` command is more efficient table

Is this still the primary goal? I think most of this background about DeleteRange is obsolete, and we need new motivation focused on the "revert" use case.


docs/RFCS/revert.md, line 151 at r6 (raw file):

To revert the previous revert at time now, the full-Range reversion is:

now is confusing here since it's referring to a time in the past. Previously, you said that RevertFrom would always be the transaction timestamp, but I don't think that would be possible here. Are you now allowing reversions to be inserted in the past? That doesn't seem safe.

This will also revert anything else with the same timestamp. These timestamp collisions are possible thanks to HLC ratchets (they were the cause of one of our jepsen failures last fall).


docs/RFCS/revert.md, line 163 at r6 (raw file):

and accept intent resolutions. When reading, the value of the
`FullRangeReversions` is read at the transaction / command time, just
like the underlying KV data. Note that the `Timestamp` is set to the

I got confused again by "just like the underlying KV data". What's tripping me up is that I'm interpreting that to mean "just like the underlying KV data is read today" while you're meaning that it will be read by recursively applying the "reading with full-range reversions" below.


docs/RFCS/fast_delete_range.md, line 362 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

This is no longer a problem since I've introduced a new Revert KV command. The rolling upgrade to this functionality will not require any specialized migration.

DROP/TRUNCATE TABLE cannot issue Revert commands as long as there are nodes in the cluster that do not understand that command, so a migration strategy is still necessary.


docs/RFCS/fast_delete_range.md, line 133 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

Tobi's suggestion was From <= To; as of r6 this RFC is using To <= From.


docs/RFCS/fast_delete_range.md, line 239 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Good point. On second thought, we should probably not revert to normal DeleteRange – better to instead do a scan to clear the intents retry the full-Range delete/revert.

Aside from the fact that per-key revert tombstones would be an unacceptable complication, we expect these full-Range actions on quiescent or at the very least, mostly-quiescent, tables, so I think this approach is better than gumming the works up with per-key writes.

I've changed the RFC to instead introduce a new KV command: Revert, which works only on full Ranges and returns an IllegalRevertError in the event of a partial Range invocation. The Revert command works for truncates and drops by setting RevertTo = 0.

Where would that scan and retry occur?


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 23, 2017

Hmm, shame the rename made this basically a new review. I can't say much other than that we could probably make all of this work somehow, but like Ben I think this is a pretty drastic change and I can't quite see why we would embark on it at this stage. I second Ben's suggestion to flesh out the table ID RFC because that seems to address an urgent need and is perhaps more straightforward.


Reviewed 4 of 4 files at r6.
Review status: all files reviewed at latest revision, 29 unresolved discussions, all commit checks successful.


docs/RFCS/revert.md, line 128 at r6 (raw file):

type Reversion struct {
RevertFrom hlc.Timestamp // Delete values before this timestamp

I'm still confused by this terminology. "From timestamp1 to timestamp2" means typically that timestamp1 < timestamp2. Here it's the other way around.


docs/RFCS/revert.md, line 265 at r6 (raw file):

possible key in the Range.

## MVCC Garbage Collection

Is this section aware of the fact that reversions can be reverted?


docs/RFCS/revert.md, line 405 at r6 (raw file):

Handling Range splits is straightforward: just duplicate the
full-Range reversion versions. It's easy to see how this is correct:

and if there's an intent? You don't want to duplicate that. Do you not allow splits while there's an intent?


Comments from Reviewable

@spencerkimball spencerkimball changed the title rfc: Fast DeleteRange Command rfc: Revert Command Jul 5, 2017
@spencerkimball spencerkimball force-pushed the spencerkimball/delete-range-rfc branch from 0277de1 to f0452b8 Compare July 6, 2017 23:11
@spencerkimball
Copy link
Member Author

I've improved the motivation section to make it clear that the primary goal of this RFC is point in time recovery.

The reason we're embarking on this is because it's a missing piece in our backup / restore story. It's been requested repeatedly by customers with enterprise production deployments in mind. PIT recovery is an important component of most disaster recovery plans.


Review status: 1 of 2 files reviewed at latest revision, 28 unresolved discussions.


docs/RFCS/revert.md, line 13 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Now that this is all about revert, what is left unspecified? Just the SQL syntax?

Removed this verbiage in the cleanup effort. I added a small section in the detailed design for proposed syntax.


docs/RFCS/revert.md, line 30 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is this still the primary goal? I think most of this background about DeleteRange is obsolete, and we need new motivation focused on the "revert" use case.

Fixed.


docs/RFCS/revert.md, line 128 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm still confused by this terminology. "From timestamp1 to timestamp2" means typically that timestamp1 < timestamp2. Here it's the other way around.

Yes, these are the other way around. It makes more sense to me that RevertFrom > RevertTo for the case of "revert". You'd be mangling the English language to make RevertTo refer to the later timestamp. If you changed the sign outside your office from "Tobias" to "T. Schottdorf", and then reconsidered, it would be correct to say, 'I'm going to revert my office sign from "T. Schottdorf" to "Tobias".'

In any case, I've removed the RevertFrom timestamp, as that was just redundant with the timestamp at which the MVCC value for the most recent reversion was written.


docs/RFCS/revert.md, line 151 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

now is confusing here since it's referring to a time in the past. Previously, you said that RevertFrom would always be the transaction timestamp, but I don't think that would be possible here. Are you now allowing reversions to be inserted in the past? That doesn't seem safe.

This will also revert anything else with the same timestamp. These timestamp collisions are possible thanks to HLC ratchets (they were the cause of one of our jepsen failures last fall).

Removed use of now in favor of t1 and t2.

No, you cannot insert reversions in the past (where "past" is defined as anything older than the most recent write to a range). Explicitly including RevertFrom is unnecessary; I've removed it. Its value is just the MVCC timestamp at which the full-Range reversion was committed.

No value in the Range may have the same timestamp as the full-Range reversion. To see why this is true, consider: the full-Range reversion key is written using a span from start to end of the full Range in the command queue. It must choose a timestamp greater than the most recently written value (it uses the MVCCStats.LastUpdateNanos value). The next write after the revert clears the command queue will fail unless its timestamp is greater (anything <= will get a WriteTooOldError).


docs/RFCS/revert.md, line 163 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I got confused again by "just like the underlying KV data". What's tripping me up is that I'm interpreting that to mean "just like the underlying KV data is read today" while you're meaning that it will be read by recursively applying the "reading with full-range reversions" below.

OK, I've cleaned up the wording here.


docs/RFCS/revert.md, line 265 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Is this section aware of the fact that reversions can be reverted?

Yes, I've simplified the algorithm dramatically so it's easier to reason about correctness.


docs/RFCS/revert.md, line 405 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

and if there's an intent? You don't want to duplicate that. Do you not allow splits while there's an intent?

Added a note to that effect.


docs/RFCS/fast_delete_range.md, line 362 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

DROP/TRUNCATE TABLE cannot issue Revert commands as long as there are nodes in the cluster that do not understand that command, so a migration strategy is still necessary.

I see your point. Added a bullet in "Unresolved Questions" section. Will have to think more about the migration. Seems like we need either an upgrade command or something not yet dreamed up.


docs/RFCS/fast_delete_range.md, line 133 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Tobi's suggestion was From <= To; as of r6 this RFC is using To <= From.

To <= From makes more sense; see earlier comment.


docs/RFCS/fast_delete_range.md, line 239 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Where would that scan and retry occur?

Actually I think it would be preferable to return an error indicating the range is busy. Updated to reflect this.


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 1 of 2 files reviewed at latest revision, 29 unresolved discussions, some commit checks failed.


docs/RFCS/revert.md, line 10 at r7 (raw file):

# Summary

This RFC describes a new `Revert` KV operation for efficient,

Both use cases in this RFC (restore and truncate) assume that you're restoring to a timestamp that's within in the data's TTL, right?
As I was discussing in the thread below, I think we should try to get out of the business of large TTLs.
https://forum.cockroachlabs.com/t/should-we-significantly-reduce-the-default-ttl/794/6


Comments from Reviewable

@bdarnell
Copy link
Contributor

PIT recovery is an important component of most disaster recovery plans.

Some form of PIT recovery may be a necessary feature, but in many (most?) databases this is implemented by restoring from a backup and replaying logs, instead of some sort of O(1) MVCC magic (Oracle appears to be an exception with their Flashback feature). O(1) PIT recovery is a cool feature, but there's an alternative design that builds on change feeds and backup/restore (thereby overlapping with work that we're doing anyway).


Review status: 1 of 2 files reviewed at latest revision, 29 unresolved discussions, some commit checks failed.


docs/RFCS/revert.md, line 128 at r6 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Yes, these are the other way around. It makes more sense to me that RevertFrom > RevertTo for the case of "revert". You'd be mangling the English language to make RevertTo refer to the later timestamp. If you changed the sign outside your office from "Tobias" to "T. Schottdorf", and then reconsidered, it would be correct to say, 'I'm going to revert my office sign from "T. Schottdorf" to "Tobias".'

In any case, I've removed the RevertFrom timestamp, as that was just redundant with the timestamp at which the MVCC value for the most recent reversion was written.

FWIW my linguistic intuition matches Tobi's. In the context of this protobuf the english usage "from x to y" implying "x < y" is stronger than the association with "revert from x to y".


docs/RFCS/revert.md, line 49 at r7 (raw file):

consuming and require the database to be offline.

Point in time recovery in CockroachDB cannot easily be implemented

It's not easy to implement a log that can be played back, but it's work that we're already planning to do for change data capture (which has value beyond PIT recovery)


docs/RFCS/revert.md, line 302 at r7 (raw file):

  • If there are any reversions more recent than the threshold, skip GC

So regular use of reversion could delay GC indefinitely?


docs/RFCS/revert.md, line 439 at r7 (raw file):

Switching IDs in the schema won't work for point-in-time recovery.

# Unresolved questions

Another question which needs to be resolved: how does this interact with schema changes? What happens if you revert a table into an earlier phase of the schema change process?


docs/RFCS/revert.md, line 441 at r7 (raw file):

# Unresolved questions

- This change will require a migration step because nodes running the

I think we need at least an outline of the migration process before this RFC can be approved.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/delete-range-rfc branch 2 times, most recently from 4d346ed to f046610 Compare July 17, 2017 21:26
@spencerkimball
Copy link
Member Author

As stated in the RFC, most PIT recovery solutions require a quiesced database, restore from backup, and replay of logs up to the desired timestamp. While this will be possible with the change feeds feature, not being able to do this in situ, and quickly, makes it a sub-optimal solution. This, I'd argue, should be the "core" solution to PIT recovery. What's presented in this RFC is an enterprise feature.


Review status: 1 of 2 files reviewed at latest revision, 29 unresolved discussions, some commit checks failed.


docs/RFCS/revert.md, line 128 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

FWIW my linguistic intuition matches Tobi's. In the context of this protobuf the english usage "from x to y" implying "x < y" is stronger than the association with "revert from x to y".

http://sentence.yourdictionary.com/revert

In the first five I bothered to read, the use of the verb revert indicates moving to an earlier state in time.

I don't see how an English speaker, seeing this struct is a reversion, would associate "to" with the most recent timestamp which is being reverted.


docs/RFCS/revert.md, line 10 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Both use cases in this RFC (restore and truncate) assume that you're restoring to a timestamp that's within in the data's TTL, right?
As I was discussing in the thread below, I think we should try to get out of the business of large TTLs.
https://forum.cockroachlabs.com/t/should-we-significantly-reduce-the-default-ttl/794/6

We might set the default TTL lower than 1d, but I think the main reasons suggested in the forum thread you've linked here are better addressed via other means. For example, recommending that high-churn tables manually specify a lower TTL or even providing new GC options, such as a maximum number of versions allowed, so the cost of GC is amortized and proportional to write volume.


docs/RFCS/revert.md, line 49 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's not easy to implement a log that can be played back, but it's work that we're already planning to do for change data capture (which has value beyond PIT recovery)

I just removed the sentence. It's not necessary to support the point that a fast, in-situ solution is possible via MVCC data we're already keeping.


docs/RFCS/revert.md, line 302 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

So regular use of reversion could delay GC indefinitely?

Yes. There are ways to get fancier, but this seemed like a reasonable starting point that is easy to reason about.


docs/RFCS/revert.md, line 439 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Another question which needs to be resolved: how does this interact with schema changes? What happens if you revert a table into an earlier phase of the schema change process?

I added this as an unresolved question.


docs/RFCS/revert.md, line 441 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think we need at least an outline of the migration process before this RFC can be approved.

See #16977 for a migration process. I removed this from the Unresolved section into Detailed Design.


Comments from Reviewable

@bdarnell
Copy link
Contributor

The fact that most PIT recovery solutions are very disruptive indicates that while some form of PIT recovery may be a requirement, fast PIT recovery is just nice-to-have. I still believe that this will be an expensive and destabilizing change that is not a good investment at this time.


Review status: 1 of 2 files reviewed at latest revision, 27 unresolved discussions, some commit checks failed.


docs/RFCS/revert.md, line 128 at r6 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

http://sentence.yourdictionary.com/revert

In the first five I bothered to read, the use of the verb revert indicates moving to an earlier state in time.

I don't see how an English speaker, seeing this struct is a reversion, would associate "to" with the most recent timestamp which is being reverted.

Now that it's RevertTo alone, I agree. But when it was RevertTo and RevertFrom, the to/from pairing dominated in my mind. Different people read things differently.


docs/RFCS/revert.md, line 405 at r6 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Added a note to that effect.

This should be added as a drawback.


docs/RFCS/revert.md, line 302 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Yes. There are ways to get fancier, but this seemed like a reasonable starting point that is easy to reason about.

This should be added as a drawback.


docs/RFCS/revert.md, line 439 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I added this as an unresolved question.

It's a big question to leave unresolved; I think we're at least going to have to disallow reverts "into" a schema change, which is a pretty big caveat.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jul 18, 2017

I have to agree with Ben that the opportunity cost is very high, in particular since the interference with all existing and future features is tricky. I share his view that we should not implement this in the near future.


Reviewed 1 of 1 files at r8.
Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Collaborator

What is the near future? Revert isn't happening for 1.1. The open question is whether it should be restarted for 1.2 or some future release or whether we should burden change feeds with supporting PIT recovery.


Review status: all files reviewed at latest revision, 28 unresolved discussions, some commit checks failed.


docs/RFCS/revert.md, line 86 at r8 (raw file):

**Side note**: The difference between `TRUNCATE TABLE` and `DROP
Table` is that `TRUNCATE TABLE` must accommodate all row deletions
into a single transaction and so will break for large tables. By

This is inaccurate. TRUNCATE TABLE was previously transactional, but there was no need for it to be. TRUNCATE TABLE in Postgres is explicitly not transactional.

I'd nix this whole section about DROP TABLE and TRUNCATE TABLE.


docs/RFCS/revert.md, line 441 at r8 (raw file):

SQL layer to swap in a new ID. This would work well for the case of
blind puts, which would continue to be efficient even after the
truncate.

I believe the DROP and TRUNCATE TABLE work is now done.


Comments from Reviewable

@bdarnell
Copy link
Contributor

I haven't looked at everything that's on the table for 1.2 but I doubt there's going to be room for this. Change feeds are definitely higher priority than this in my book (in their own right), so I'd be in favor of relying on them for point-in-time recovery in the 1.2 time frame.

@spencerkimball spencerkimball force-pushed the spencerkimball/delete-range-rfc branch from f046610 to bd6328f Compare July 19, 2017 19:04
@spencerkimball
Copy link
Member Author

This doesn't feel like the right forum for a discussion about when this RFC should be implemented. Please limit comments to the technical aspects of the RFC.


Review status: all files reviewed at latest revision, 28 unresolved discussions, some commit checks failed.


docs/RFCS/revert.md, line 405 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This should be added as a drawback.

Added a general drawback covering the interaction between revert and intents.


docs/RFCS/revert.md, line 302 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This should be added as a drawback.

Done.


docs/RFCS/revert.md, line 439 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's a big question to leave unresolved; I think we're at least going to have to disallow reverts "into" a schema change, which is a pretty big caveat.

I moved this into detailed design, but it still needs work. Really we need to copy the schema as it existed at the RevertTo timestamp to the current timestamp. This is complicated by how the table leases interact with the Revert transaction.

As a first step, the practical solution is to prevent Revert to an earlier timestamp than the most recent schema change.


docs/RFCS/revert.md, line 86 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is inaccurate. TRUNCATE TABLE was previously transactional, but there was no need for it to be. TRUNCATE TABLE in Postgres is explicitly not transactional.

I'd nix this whole section about DROP TABLE and TRUNCATE TABLE.

Done.


docs/RFCS/revert.md, line 441 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I believe the DROP and TRUNCATE TABLE work is now done.

Done.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Aug 25, 2017

reminder: don't forget to rebase if you plan to continue work on this, and pick up the new RFC sections + naming convention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants