storage: tolerate missing transaction records when pushing #33523

nvanbenschoten · 2019-01-05T04:08:15Z

Informs #25437.
Informs #32971.

This is the second part of addressing #32971. The final part will be updating the txn client to stop sending BeginTxn requests.

The change closely resembles what is laid out in the corresponding RFC sections (PushTxn and QueryTxn / txnwait.Queue). Both PushTxn requests and QueryTxn requests are adjusted to tolerate missing transaction records and to synthesize them based on the information pulled from intents if necessary. By hiding these details behind the request API abstraction, we don't actually need to modify the txnwait.Queue at all!

The change does diverge from the RFC in two distinct ways. The first is that it introduces a new invariant about transaction record creation. Transaction records can now only ever be created by requests originating from their own coordinator (BeginTxn, HeartbeatTxn, and EndTxn). They can never be created in any state (not even ABORTED) by concurrent actors. This is a much stronger invariant than what previously existed (and even what was proposed in the RFC), and it simplifies a lot of logic by dramatically refining the transaction state machine. We now know that for a transaction record to exist, it must have been created by its own coordinator and it must have checked with CanCreateTxnRecord.

Making this new invariant work required the second major divergence from the RFC. The RFC suggested that we use the read timestamp cache for successful transaction timestamp pushes before a transaction record is written. This PR takes this a step further and uses the write timestamp cache for successful transaction aborts before a transaction record is written. In doing this, we emerge with a very sane timestamp cache policy - the read timestamp cache is used to push transaction timestamps and the write timestamp cache is used to abort them entirely (which it was already essentially being used for). The txnID marker on these timestamp cache entries becomes the transaction that initiated the push/abort. Transactions then consult both of these sources before creating their transaction record in CanCreateTxnRecord. In doing this, we avoid an entire class of situations where concurrent transactions created and abandoned transaction records for another transaction, requiring eventual GC.

cockroach-teamcity · 2019-01-05T04:08:26Z

This change is

nvanbenschoten · 2019-01-07T15:32:36Z

Here's a state transition diagram that reflects the transaction state machine after this change. Note that BeginTxn, HeartbeatTxn, and EndTxn are now the only methods to transition from a missing transaction record to a transaction record. That's a major simplification made by this PR, which was permitted by using the timestamp cache marginally more often.

+-----------------------------------+
|vars                               |
|-----------------------------------|
|v1 = rTSCache[txn.id]   = timestamp|
|v2 = wTSCache[txn.id]   = timestamp|
|v3 = txnSpanGCThreshold = timestamp|
+-----------------------------------+

                  PushTxn(TIMESTAMP)                               HeartbeatTxn
                  then: v1 = push.ts                            then: update record
                     +------+                                        +------+
                     |      |        BeginTxn or HeartbeatTxn        |      |
                     |      v        if: v2 == nil                   |      v
  PushTxn(ABORT)  +---------------+    & v3 < txn.orig         +--------------------+ PushTxn(TIMESTAMP) 
then: v2 = txn.ts |               |  then: txn.ts.forward(v1)  |                    | then: update record
             +----|               |  else: fail                |                    |----+
             |    | no txn record | -------------------------> | txn record written |    |
             |    |               |                            |     [pending]      |    |
             +--->|               |                            |                    |<---+
                  +---------------+                            +--------------------+
                    ^          ^     \                                /           /
                    |           \     \  EndTxn                      / EndTxn    /
                    |            \     \  if: same conditions       / then: v2 = txn.ts
                    |             \     \     as above             /           /
                    |     Eager GC \     \ then: v2 = txn.ts       /           /
                    |    if: EndTxn \     \ else: fail           /           / PushTxn(ABORT)
                    |    transition  \     v                    v           /                         
                    |         taken   \    +--------------------+          /
                    |                  \   |                    |         /
                     \                  +--|                    |        /
                      \                    | txn record written |<------+
                       +-------------------|     [finalized]    |
                         GC queue          |                    |
                       then: v3 = now()    +--------------------+

The diagram also reveals a redundancy that we could consider simplifying in a later change. If we had PushTxn(ABORT) update the write timestamp cache then we wouldn't need TxnSpanGCThreshold at all.

nvanbenschoten · 2019-01-08T01:16:19Z

I rebased this on top of #33396 and updated it with a few leftover comments from that PR. It's now ready for a look!

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_tscache.go, line 35 at r1 (raw file):

) {
	readOnlyUseReadCache := true
	if r.store.Clock().MaxOffset() == timeutil.ClocklessMaxOffset {

what's the story with this "clockless mode"? Was someone (Ben?) recently trying to delete all the code around it?
If we don't wipe it all, what's the deal with this patch? Does it... break anything?

pkg/storage/txnwait/txnqueue.go, line 466 at r2 (raw file):

	pusheePriority := req.PusheeTxn.Priority

	var pusheeTxnTimer timeutil.Timer

this commit is just cleanup unrelated to the following one, right?
Why do you say that this could result in starvation if the clock is bad?

andreimatei

I really like that diagram!
A few things for now; will look more.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/roachpb/api.go, line 1031 at r3 (raw file):

func (*HeartbeatTxnRequest) flags() int        { return isWrite | isTxn }
func (*GCRequest) flags() int                  { return isWrite | isRange }
func (*PushTxnRequest) flags() int             { return isWrite | isAlone | updatesReadTSCache }

would you mind adding a comment to the new updatesTSCache explaining that the push does it if it doesn't find a txn record but doesn't do it if it does find it (if indeed what you described in the diagram is real)

pkg/roachpb/errors.proto, line 127 at r3 (raw file):

  // A BeginTransaction, HeartbeatTxn, or EndTransaction(commit=true) request
  // found an aborted transaction record. Another txn must have written this

wait - the "another txn" part is now false, isn't it?

And the "before out txn record was written" part is false... I think this whole comment should be rewritten, no?

pkg/roachpb/errors.proto, line 156 at r3 (raw file):

  ABORT_REASON_ABORT_SPAN = 5;

  // A requests attempting to create a transaction record encountered a

s/requests/request

pkg/roachpb/errors.proto, line 157 at r3 (raw file):

  // A requests attempting to create a transaction record encountered a
  // timestamp cache entry for the txn key, and the entry identifies this

s/timestamp cache entry/write timestamp cache entry (right?)

pkg/storage/gc_queue_test.go, line 718 at r3 (raw file):

			txn.LastHeartbeat = hlc.Timestamp{WallTime: test.hb}
		}
		// Set a high Timestamp to make sure it does not matter. Only

do you remember how come you had to change this?

pkg/storage/replica_tscache.go, line 293 at r3 (raw file):

	// Also look in the write timestamp cache to see if there is an entry for
	// this transaction, which would indicate this transaction has already been
	// finalized or was already aborted by a conccurent transaction. If there is

conccurent - you need to double something, but not anything. Romanians have the same strrugles.

pkg/storage/replica_tscache.go, line 302 at r3 (raw file):

	// written intents at. We can't compare against the provisional commit
	// timestamp because it may have moved forward over the course of a single
	// epoch and we can't compare against the current epoch's original timestamp

nit: s/current epoch's original timestamp/(current epoch's) OrigTimestamp

Also this comment about OrigTimestamp I think belongs in the method comment, not here (here there's no OrigTimestamp to speak of even if we did want to compare against it). And I wanted to ask for hints about what the txnMinTSUpperBound is about anyway in the method comment. Or rephrase the comment completely - see next.

Also :) - I think the part about "the minimum timestamp that the transaction could have written intents at" deserves more explanations. Cause this is all elegant and interesting, but... complex. Say something about how the check needs to be pared up with the actors who could have possibly intended to leave a marker in the wtscache. What timestamps would those people have used? They would have used an intent's timestamp (for push[abort]) or the epochZeroOrigTS for commits (presumably... but I haven't found that code).

pkg/storage/replica_tscache.go, line 320 at r3 (raw file):
nit:

			// aborted us before our transaction record was written. It obeyed
			// the invariant that it couldn't create a transaction record for

I don't know if the use of an "invariant" really works here. An invariant is more like a propery that all states in a state space have, not about who does what. Like, the gossip module doesn't say it has an invariant about how no other code starts randomly sending gossip messages.
But do what you will.

pkg/storage/replica_tscache.go, line 339 at r3 (raw file):

	}

	return true, minCommitTS, 0

0?!? Use a constant.

pkg/storage/batcheval/cmd_end_transaction.go, line 353 at r3 (raw file):

	// will be too old. Replays of requests which attempt to create a new txn
	// record (BeginTransaction, HeartbeatTxn, or EndTransaction) never succeed
	// because EndTransaction inserts in the write timestamp cache, forcing the

would you mind pointing in this comment (or just telling me) where exactly the wTSCache is updated here?

pkg/storage/batcheval/eval_context.go, line 68 at r3 (raw file):

	ContainsKey(key roachpb.Key) bool

	// CanCreateTxnRecord determines whether a transaction record can be created

nit: say that this needs to be called before a txn record is written, and say that it protects against replays (Begin/Heartbeat after Commit/Rollback/Push[Abort]) and against committing transactions that have already been pushed / aborted by others.

Based on cockroachdb#33523. Closes cockroachdb#25437. This is the final part of addressing cockroachdb#32971. The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them. To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits. Release note (performance improvement): Reduce the network and storage overhead of multi-Range transactions.

tbg

I might be able to answer this question myself, but perhaps it's worth discussing somewhere public:
Txn record creation is "blocked" via an entry into the write ts cache (the no record -> pushtxnabort -> pushtxn case) by writing now(). Is it possible that txn.orig is larger than now? After all, the txn may never have interacted with the node that houses the slot for the txn record. Right now, we anchor txns at the first write, but is this now critical for correctness? I think it'd be good to think through the case in which a txn record can be anywhere, including on ranges never touched by the txn otherwise. Are we relying on the push propagating the OrigTimestamp into the clock before generating now()?

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/roachpb/api.go, line 1031 at r3 (raw file):

func (*HeartbeatTxnRequest) flags() int        { return isWrite | isTxn }
func (*GCRequest) flags() int                  { return isWrite | isRange }
func (*PushTxnRequest) flags() int             { return isWrite | isAlone | updatesReadTSCache }

There's no flag for the wTSCache, right? It's unexpected to not see it here.

pkg/storage/txnwait/txnqueue.go, line 466 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this commit is just cleanup unrelated to the following one, right?
Why do you say that this could result in starvation if the clock is bad?

I also don't understand how a manual clock (which I assume is what's meant) makes this loop hot. Perhaps now() is very close to expiration?

andreimatei

The diagram also reveals a redundancy that we could consider simplifying in a later change. If we had PushTxn(ABORT) update the write timestamp cache then we wouldn't need TxnSpanGCThreshold at all.

Yeah, I happen to like this idea. Let's get rid of TxnSpanGCThreashold! Would you mind leaving a TODO on the definition of TxnSpanGCThreashold with your notes from here plus what we discussed: that we could simply have the GC process insert a span in the wtsc covers all the (range-local) txn records. Although what would that give us exactly, besides not allowing people to write (new) txn records below this GC threshold (tbh, I'm unsure if disallowing that is desirable or not).
I might be inclined to pull on that string. I think it'd make the GC process not be something that we constantly have to remember about and take into account.

still looking

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_test.go, line 3195 at r3 (raw file):

// TestTxnRecordUnderTxnSpanGCThreshold verifies that aborting transactions
// which haven't written their initial txn record yet does not lead to
// anomalies. Precisely, verify that if the GC queue could potentially have

this comment is now stale, right?

pkg/storage/replica_test.go, line 3215 at r3 (raw file):

	pusher := newTransaction("pusher", key, 1, tc.Clock())

	// This pushee should never be allowed to write a txn record because it

stale comment

pkg/storage/replica_test.go, line 3297 at r3 (raw file):

	// a transaction record, it must first abort it, which is recorded in the
	// write timestamp cache. Should we remove this logic or do we still like it?
	// Removing it would allow new transactions beneath the GC threshold, but

As I was saying in a review comment, I'd move discussion of this good stuff to the definition of TxnSpanGCThreshold and perhaps link to that here.

pkg/storage/batcheval/cmd_end_transaction.go, line 353 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

would you mind pointing in this comment (or just telling me) where exactly the wTSCache is updated here?

you've told me, but I think a note here would be useful

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_tscache.go, line 86 at r3 (raw file):

				// ensures that we can safely update the timestamp cache based
				// on this value.
				// TODO(nvanbenschoten): Is this necessary? Is this the right

@tbg , me and Nathan are discussing things here and would like some opinion.
The issue is that we're about to put something in the ts cache. That something needs to be <= the node's current clock.
The value that we're about to put is the timestamp to which we're pushing the pushee. This timestamp comes from the pusher (another node) is pushing to a timestamp <= its clock. The PushTxn request doesn't have a header.Txn filled in [1]. I think this means that it also doesn't have header.Timestamp filled in (as clients don't put timestamps on non-transactional requests; the timestamps for these requests are filled in by the store). So, I think this all means that somewhere we need to be doing r.store.Clock().Update(pushee.Timestamp), as the patch has below. I suggested we do that during evaluation, in evalPushTxn(). What do you think?

Moreover, the patch does reply.PusheeTxn.Timestamp = args.PushTo.Add(0, 1) in evalPushTxn because it might need to write the txn record with that timestamp. And so, particularly since we're generating a timestamp out of thin air, I think the node's clock should be updated immediately to account for that generated timestamp, right?

[1]

cockroach/pkg/storage/batcheval/cmd_push_txn.go

Line 93 in 5cb2c38

if cArgs.Header.Txn != nil {

pkg/storage/batcheval/cmd_push_txn.go, line 138 at r3 (raw file):

		// To determine which case we're in, we check whether the transaction could
		// ever write a transaction record. We do this by using the metadata from
		// the intent and attempting to synthesize a transaction record (TODO?). If

TODO?

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_tscache.go, line 86 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

@tbg , me and Nathan are discussing things here and would like some opinion.
The issue is that we're about to put something in the ts cache. That something needs to be <= the node's current clock.
The value that we're about to put is the timestamp to which we're pushing the pushee. This timestamp comes from the pusher (another node) is pushing to a timestamp <= its clock. The PushTxn request doesn't have a header.Txn filled in [1]. I think this means that it also doesn't have header.Timestamp filled in (as clients don't put timestamps on non-transactional requests; the timestamps for these requests are filled in by the store). So, I think this all means that somewhere we need to be doing r.store.Clock().Update(pushee.Timestamp), as the patch has below. I suggested we do that during evaluation, in evalPushTxn(). What do you think?

Moreover, the patch does reply.PusheeTxn.Timestamp = args.PushTo.Add(0, 1) in evalPushTxn because it might need to write the txn record with that timestamp. And so, particularly since we're generating a timestamp out of thin air, I think the node's clock should be updated immediately to account for that generated timestamp, right?

[1]

cockroach/pkg/storage/batcheval/cmd_push_txn.go

Line 93 in 5cb2c38

if cArgs.Header.Txn != nil {

Maybe the argument that the node's clock should be below any logical timestamp we operate with is not quite funded - in applyTimestampCache we do nextTS := rTS.Next() which I guess can return a value above the local clock, and we seem cool with that.

andreimatei

LGTM !

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/batcheval/transaction.go, line 135 at r3 (raw file):

	// Determine whether the transaction record could ever actually be written
	// in the future. We provide the intent's timestamp as the upper bound on

"intent"? What "intent"? Perhaps put more hints in the function comment about where this meta is coming from.

nvanbenschoten

Thanks for the reviews here!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/replica_test.go, line 3195 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

is cleaned up

Done.

pkg/storage/replica_test.go, line 4740 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

But it doesn't verify that it's finalized, it populates the write ts and pretends the txn is aborted, right? This comment somehow makes it seem like the push would return a COMMITTED proto based on some intel from the tsCache.

Done.

pkg/storage/replica_test.go, line 4955 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why did this get removed?

I think I was mixing up reply.Txn and reply.PusheeTxn. The test still makes sense. I reverted my changes here and fixed the test instead.

pkg/storage/replica_tscache.go, line 86 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yes, this (calling .Next() and thus leaking a possible future timestamp) is a problem that we should address (though it's hopefully only a theoretical one). The problem is likely theoretical because reading a clock value from the hlc already also increments the logical (if it doesn't already do it to the wall time), so you need multiple logical ticks to pile up for something to get out of order. The code for the fix would be
nextTS := rTS.Next()
hlc.Update(nextTS)
but it's starting to really irk me that we're grabbing the hlc lock in so many places (this can be optimized if we had a "slightly stale" reading of the clock, because most of the time nextTS is below anyway and so nothing needs to be updated).

The PushTxn issue is also an issue. The fact that we upgrade the hlc "manually" is bad. The way this could work conceptually is that each request to another node has a hlc.Now() reading that is applied at the recipient before doing anything else with that message.

I think the discussion should move to an issue, would you mind filing one if you haven't already?

I just opened #33632. We can discuss further there.

pkg/storage/replica_tscache.go, line 90 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I wonder what could happen without this. We'll put something in the timestamp cache that is a valid read of some node's clock, but perhaps not this one. We won't cause cascading updates (the original mantra of "never put manual timestamps into the clock" came from the manual offset looping around multiple times, until it exceeded MaxOffset and caused trouble), and we may return a lower observed timestamp than we would otherwise. Lowering an observed timestamp is only a problem if in doing so we let the a reader txn avoid an uncertainty restart, but that presupposes an intent above the accidentally "lower" observed timestamp. I think in the end this update doesn't matter for correctness (what really matters is that writes populate the hlc correctly with their provisional write ts), but it sure is hard to reason about it all.

My fear here was that without this the node could have a lower clock than the pushee's pushed timestamp. It could then transfer its lease and the new leaseholder could create a low water mark entry at a lower timestamp than pushee.Timestamp. The transaction record could then be created with too low of a timestamp.

pkg/storage/replica_tscache.go, line 241 at r6 (raw file):

Could you make vs == nil more precise? The timestamp cache always has a low water mark, so you're going to get some timestamp. I assume you want v2 >= txn.orig?

Done.

= txn.ts would be better for long-running txns, but probably incorrect

Yes, this is exactly why this PR diverges from the RFC and performs the validation when creating transaction records instead of when committing them. This gives a transaction an upper-bound on its window of vulnerability instead of a window that scales proportionally with the duration of the transaction.

Also, do you have heuristics on how fast the timestamp cache's write ts rises? With all commits populating it, it'll be at a steady pace.

No faster than the read timestamp cache rising on every read, and that will have roughly the same effect. This effect also isn't new, so I'm not too worried about that causing a sudden regression.

Here's the math though:

interval skl key size = size(LocalTransactionSuffix) + size(anchor key) + size(uuid) = 20 + size(anchor key)
interval skl val size = encodedValSize = 32
interval skl node size = 32 (assuming an average tower height of 2)
(assuming 64 byte keys) interval skl entry = 146

interval skl page size = 32 MB
entries per page = 32 MB / 116 = 226,719 txns per page

so in order to reach the crossover point where the MinRetentionWindow of 10s is needed and we push past the standard 2 skl pages, we need to be performing 45,000 tps (non-1PC) on a given node's leaseholders. That's not crazy high, but I also don't expect to hit that any time soon.

pkg/storage/replica_tscache.go, line 254 at r6 (raw file):

Is it intentional that you're not updating v2 in the PushTxn(ABORT) on the right

No, that was a typo. We do this already in this PR.

but I'm curious if you've considered eager GC unconditionally which seems possible if you do update v2 on all aborting pushes

I haven't thought that all the way through to its conclusion, but I like this idea. This also ties in to the TODOs I scattered around with removing the TxnSpanGCThreshold checks. I added the idea to the TODO.

It would also simplify batch gc of transactions because after my suggestion, any finalized txn record can simply be removed because they're always protected by wTS, and we don't have to pessimistically check/populate the ts cache as we remove the records. It's also a nice invariant that finalized records are always protected by v2.

Yes, I completely agree. I'm going to save some of this for a follow-up PR, but you're right that there's still room to simplify.

pkg/storage/replica_tscache.go, line 312 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Or if the cache got rolled over due to size.

Done.

pkg/storage/replica_tscache.go, line 322 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this reason adequately named and is it returned anywhere else? Consider renaming and/or adding a new value that disambiguates this case.

Yes, it's the exact same reason we return when an actual ABORTED transaction record is found.

pkg/storage/batcheval/cmd_push_txn.go, line 76 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this important? Would things go awry if that changed in the future?

Yes, but it's not a correctness issue. Added to this comment.

pkg/storage/batcheval/cmd_push_txn.go, line 79 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Mention that the expiration is "courtesy" -- if we'd simply abort txns right away that would usually result in worse performance, but it would still be correct.

Good point, done.

pkg/storage/batcheval/cmd_push_txn.go, line 91 at r6 (raw file):

How does it know that the coordinator has found out

Why does it care whether the coordinator has found out?

pkg/storage/batcheval/cmd_push_txn.go, line 264 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

for a

Done.

pkg/storage/batcheval/eval_context.go, line 69 at r6 (raw file):

isn't that a lower bound

I think it's an upper bound. It's bounding the possible values of the transactions minimum timestamp from the top.

Also clarify that it's write timestamp.

It doesn't really need to be the write timestamp. For instance, a transaction may never write at its OrigTimestamp.

It's weird that a bunch of info is here and then the state diagram is on the implementation, I went and looked up this comment because I wanted to understand the txnMinTSUpperBound. Consider moving both explanations in one place.

Done.

pkg/storage/batcheval/eval_context.go, line 83 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This makes it sound like there's nothing "correct" that they know. Hint at what they should pass instead.

Done.

pkg/storage/batcheval/transaction.go, line 117 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It also couldn't be written as ABORTED, right? (Not that that's important)

I think it could be. We allow rollbacks through.

nvanbenschoten · 2019-01-10T19:36:04Z

bors r+

33523: storage: tolerate missing transaction records when pushing r=nvanbenschoten a=nvanbenschoten Informs #25437. Informs #32971. This is the second part of addressing #32971. The final part will be updating the txn client to stop sending BeginTxn requests. The change closely resembles what is laid out in the corresponding RFC sections (`PushTxn` and `QueryTxn / txnwait.Queue`). Both `PushTxn` requests and `QueryTxn` requests are adjusted to tolerate missing transaction records and to synthesize them based on the information pulled from intents if necessary. By hiding these details behind the request API abstraction, we don't actually need to modify the `txnwait.Queue` at all! The change does diverge from the RFC in two distinct ways. The first is that it introduces a new invariant about transaction record creation. Transaction records can now only ever be created by requests originating from their own coordinator (`BeginTxn`, `HeartbeatTxn`, and `EndTxn`). They can never be created in any state (not even ABORTED) by concurrent actors. This is a much stronger invariant than what previously existed (and even what was proposed in the RFC), and it simplifies a lot of logic by dramatically refining the transaction state machine. We now know that for a transaction record to exist, it must have been created by its own coordinator and it must have checked with `CanCreateTxnRecord`. Making this new invariant work required the second major divergence from the RFC. The RFC suggested that we use the read timestamp cache for successful transaction timestamp pushes before a transaction record is written. This PR takes this a step further and uses the write timestamp cache for successful transaction aborts before a transaction record is written. In doing this, we emerge with a very sane timestamp cache policy - the read timestamp cache is used to push transaction timestamps and the write timestamp cache is used to abort them entirely (which it was already essentially being used for). The txnID marker on these timestamp cache entries becomes the transaction that initiated the push/abort. Transactions then consult both of these sources before creating their transaction record in `CanCreateTxnRecord`. In doing this, we avoid an entire class of situations where concurrent transactions created and abandoned transaction records for another transaction, requiring eventual GC. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

craig · 2019-01-10T19:54:49Z

Build succeeded

GitHub CI (Cockroach)

Based on cockroachdb#33523. Closes cockroachdb#25437. This is the final part of addressing cockroachdb#32971. The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them. To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits. Release note (performance improvement): Reduce the network and storage overhead of multi-Range transactions.

33566: kv: stop using BeginTransaction requests r=nvanbenschoten a=nvanbenschoten Based on #33523. Closes #25437. This is the final part of addressing #32971. The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them. To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits. Release note (performance improvement): Reduce the network and storage overhead of multi-Range transactions. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

tbg

Reviewed 19 of 19 files at r7, 1 of 1 files at r8, 19 of 19 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/replica_tscache.go, line 90 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

My fear here was that without this the node could have a lower clock than the pushee's pushed timestamp. It could then transfer its lease and the new leaseholder could create a low water mark entry at a lower timestamp than pushee.Timestamp. The transaction record could then be created with too low of a timestamp.

👍

pkg/storage/replica_tscache.go, line 241 at r6 (raw file):

instead of when committing them

ah, I had forgotten about that. Makes sense and I'm not concerned about the timestamp cache in that case.

pkg/storage/replica_tscache.go, line 322 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yes, it's the exact same reason we return when an actual ABORTED transaction record is found.

You don't think it's worth having a separate reason for this? (Or more generally, metrics)?

pkg/storage/batcheval/cmd_push_txn.go, line 91 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How does it know that the coordinator has found out

Why does it care whether the coordinator has found out?

If the coordinator is still running, the txn might try to read its values, but the intents are gone and we might've cleaned up the abort span as well.

pkg/storage/batcheval/cmd_push_txn.go, line 272 at r9 (raw file):

	// If the transaction record was already present, persist the updates to it.
	// If not, then we don't want to create it. This could allow for finalized
	// transactions to be revived. Instead, we obey the invariant that only the

Only if we created a PENDING one. We could create ABORTED records, it just doesn't seem to make much sense.

pkg/storage/batcheval/eval_context.go, line 69 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

isn't that a lower bound

I think it's an upper bound. It's bounding the possible values of the transactions minimum timestamp from the top.

Also clarify that it's write timestamp.

It doesn't really need to be the write timestamp. For instance, a transaction may never write at its OrigTimestamp.

It's weird that a bunch of info is here and then the state diagram is on the implementation, I went and looked up this comment because I wanted to understand the txnMinTSUpperBound. Consider moving both explanations in one place.

Done.

A lower bound on a set (plus partial order) is an element that is less than or equal to any element of the set. It still seems to me that you want a lower bound. As am example, hlc.Timestamp{} is a lower bound but hlc.MaxTimestamp is an upper bound.

Cleanup from cockroachdb#33523. This commit weakens the semantics of `Replica.CanCreateTxnRecord` slightly to allow it to ignore the `TxnSpanGCThreshold`. `CanCreateTxnRecord` no longer consults the `TxnSpanGCThreshold`, so it doesn't prevent the creation of new transaction records beneath it. That's ok though, because we still prevent the recreation of aborted and GCed transaction records, which was the important part of the check. This protection is now provided by updating the write timestamp cache on successful `TxnPush(ABORT)` requests (added in cockroachdb#33523). The GC queue will always `PushTxn(ABORT)` transactions if it finds them `PENDING`, so the `TxnSpanGCThreshold` check isn't giving us anything important. For a visual summary of this change, look at the update to the state machine diagram in `replica_tscache.go`. _### Migration Safety In the next release cycle, this change will allow us to remove the `TxnSpanGCThreshold` entirely because the protection it was providing is now entirely handled by the timestamp cache. In the meantime, we need to keep setting it to ensure that the protection still works properly on old (v2.1) leaseholders. However, we're safe to no-longer consult it at all on new (v2.2) leaseholders. This is because even if a new node is given the lease from an old node that isn't correctly bumping the timestamp cache on `PushTxn` requests, we know that the old node will have a clock at least as large as a txn in question's OrigTimestamp. That means that the new leaseholder will necessarily bump its write timestamp cache to at least as large as this txn's OrigTimestamp, so `Replica.CanCreateTxnRecord` will always fail. Release note: None

As of cockroachdb#33523, this test doesn't really make sense. The transaction model is much more flexible to scenarios like this where a `PushTxn` is issued before the transaction's record is written. This flexibility required more targeted testing, which was added in cockroachdb#33523. As a result, we don't need such a high-level integration test any more. Instead, the following three subtests of `TestCreateTxnRecord` sufficiently test that such a situation is handled correctly and returns a retriable error like we were expecting: - `TestCreateTxnRecord/begin_transaction_after_end_transaction_(abort)` - `TestCreateTxnRecord/heartbeat_transaction_after_end_transaction_(abort)` - `TestCreateTxnRecord/end_transaction_(commit)_after_end_transaction_(abort)` Release note: None

34786: kv: delete TestDelayedBeginRetryable, which was replaced by unit tests r=nvanbenschoten a=nvanbenschoten As of #33523, this test doesn't really make sense. The transaction model is much more flexible to scenarios like this where a `PushTxn` is issued before the transaction's record is written. This flexibility required more targeted testing, which was added in #33523. As a result, we don't need such a high-level integration test any more. Instead, the following three subtests of `TestCreateTxnRecord` sufficiently test that such a situation is handled correctly and returns a retriable error like we were expecting: - `TestCreateTxnRecord/begin_transaction_after_end_transaction_(abort)` - `TestCreateTxnRecord/heartbeat_transaction_after_end_transaction_(abort)` - `TestCreateTxnRecord/end_transaction_(commit)_after_end_transaction_(abort)` Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Cleanup from cockroachdb#33523. This commit weakens the semantics of `Replica.CanCreateTxnRecord` slightly to allow it to ignore the `TxnSpanGCThreshold`. `CanCreateTxnRecord` no longer consults the `TxnSpanGCThreshold`, so it doesn't prevent the creation of new transaction records beneath it. That's ok though, because we still prevent the recreation of aborted and GCed transaction records, which was the important part of the check. This protection is now provided by updating the write timestamp cache on successful `TxnPush(ABORT)` requests (added in cockroachdb#33523). The GC queue will always `PushTxn(ABORT)` transactions if it finds them `PENDING`, so the `TxnSpanGCThreshold` check isn't giving us anything important. For a visual summary of this change, look at the update to the state machine diagram in `replica_tscache.go`. _### Migration Safety In the next release cycle, this change will allow us to remove the `TxnSpanGCThreshold` entirely because the protection it was providing is now entirely handled by the timestamp cache. In the meantime, we need to keep setting it to ensure that the protection still works properly on old (v2.1) leaseholders. However, we're safe to no-longer consult it at all on new (v2.2) leaseholders. This is because even if a new node is given the lease from an old node that isn't correctly bumping the timestamp cache on `PushTxn` requests, we know that the old node will have a clock at least as large as a txn in question's OrigTimestamp. That means that the new leaseholder will necessarily bump its write timestamp cache to at least as large as this txn's OrigTimestamp, so `Replica.CanCreateTxnRecord` will always fail. Release note: None

34778: storage: don't consult TxnSpanGCThreshold when creating txn records r=nvanbenschoten a=nvanbenschoten Cleanup from #33523. This commit weakens the semantics of `Replica.CanCreateTxnRecord` slightly to allow it to ignore the `TxnSpanGCThreshold`. `CanCreateTxnRecord` no longer consults the `TxnSpanGCThreshold`, so it doesn't prevent the creation of new transaction records beneath it. That's ok though, because we still prevent the recreation of aborted and GCed transaction records, which was the important part of the check. This protection is now provided by updating the write timestamp cache on successful `TxnPush(ABORT)` requests (added in #33523). The GC queue will always `PushTxn(ABORT)` transactions if it finds them `PENDING`, so the `TxnSpanGCThreshold` check isn't giving us anything important. For a visual summary of this change, look at the update to the state machine diagram in `replica_tscache.go`. ### Migration Safety In the next release cycle, this change will allow us to remove the `TxnSpanGCThreshold` entirely because the protection it was providing is now entirely handled by the timestamp cache. In the meantime, we need to keep setting it to ensure that the protection still works properly on old (v2.1) leaseholders. However, we're safe to no-longer consult it at all on new (v2.2) leaseholders. This is because even if a new node is given the lease from an old node that isn't correctly bumping the timestamp cache on `PushTxn` requests, we know that the old node will have a clock at least as large as a txn in question's OrigTimestamp. That means that the new leaseholder will necessarily bump its write timestamp cache to at least as large as this txn's OrigTimestamp, so `Replica.CanCreateTxnRecord` will always fail. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

This was discussed in cockroachdb#33523 (review) but I missed digging into it at the time. I still don't think I have my head fully wrapped around it, but I do enough to know that it's not something I want to start pulling on right now. Release note: None

35077: workload: add indexes workload r=nvanbenschoten a=nvanbenschoten Closes #34912. This adds a new workload called `indexes`. Unlike `kv`, the workload is specifically designed to stress secondary indexes. Its schema looks like: ``` CREATE TABLE indexes ( key UUID NOT NULL PRIMARY KEY, col0 INT NOT NULL, col1 INT NOT NULL, col2 INT NOT NULL, col3 INT NOT NULL, col4 INT NOT NULL, col5 INT NOT NULL, col6 INT NOT NULL, col7 INT NOT NULL, col8 INT NOT NULL, col9 INT NOT NULL, payload BYTES NOT NULL ) ``` and it includes knobs to specify how many of the "col" columns should be indexed and whether they should be indexed using a UNIQUE specifier. The workload also includes a knob to specify the size of the `payload` column. Release note: None 35116: sql: add support for extra session vars for pg compatibility r=knz a=knz Fixes #35109. This adds compatibility support for the following variables, with only the default values that make sense in CockroachDB: - `row_security` - `synchronize_seqscans`, - `lock_timeout`, - `idle_in_transaction_session_timeout` Release note: None 35134: storageccl: leave params in workload URIs r=dt a=dt workload URIs do not contain anything sensitive and including them in the job make it easier see what it was actually doing. Release note: none. 35151: sql: tolerate non-existent databases for plan cache invalidation r=knz a=knz Fixes #35145. Release note (bug fix): CockroachDB again properly reports when a database used during PREPARE does not exist any more when EXECUTE is used. 35162: storage: remove TODOs in cmd_push_txn r=nvanbenschoten a=nvanbenschoten This was discussed in #33523 (review) but I missed digging into it at the time. I still don't think I have my head fully wrapped around it, but I do enough to know that it's not something I want to start pulling on right now. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com>

@tbg

38782: storage: persist minimum transaction timestamps in intents r=nvanbenschoten a=nvanbenschoten Fixes #36089. Informs #37199. This commit addresses the second concern from #37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. @tbg I'm assigning you because you reviewed most of the lazy transaction record stuff (especially #33523), but let me know if you'd like me to find a different reviewer. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten requested review from andreimatei, tbg and a team January 5, 2019 04:08

nvanbenschoten requested a review from a team as a code owner January 5, 2019 04:08

nvanbenschoten force-pushed the nvanbenschoten/lazyPt2 branch 4 times, most recently from 1f1bc53 to e1201df Compare January 5, 2019 21:30

nvanbenschoten force-pushed the nvanbenschoten/lazyPt2 branch 2 times, most recently from b280c0b to 6a9ef2e Compare January 8, 2019 01:14

nvanbenschoten mentioned this pull request Jan 8, 2019

kv: stop using BeginTransaction requests #33566

Merged

andreimatei suggested changes Jan 8, 2019

View reviewed changes

tbg requested changes Jan 9, 2019

View reviewed changes

andreimatei suggested changes Jan 9, 2019

View reviewed changes

andreimatei approved these changes Jan 9, 2019

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/lazyPt2 branch 2 times, most recently from 454d664 to 275bcc6 Compare January 9, 2019 22:42

nvanbenschoten commented Jan 10, 2019

View reviewed changes

craig bot merged commit 3ac97e6 into cockroachdb:master Jan 10, 2019

nvanbenschoten deleted the nvanbenschoten/lazyPt2 branch January 11, 2019 00:30

tbg reviewed Jan 11, 2019

View reviewed changes

nvanbenschoten mentioned this pull request Feb 11, 2019

storage: don't consult TxnSpanGCThreshold when creating txn records #34778

Merged

nvanbenschoten mentioned this pull request Feb 11, 2019

kv: delete TestDelayedBeginRetryable, which was replaced by unit tests #34786

Merged

nvanbenschoten mentioned this pull request Feb 22, 2019

storage: remove TODOs in cmd_push_txn #35162

Merged

bdarnell mentioned this pull request Mar 4, 2019

kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876

Closed

nvanbenschoten mentioned this pull request May 1, 2019

storage: propagate errors from contentionQueue, catch stalls in roachtest #37199

Merged

nvanbenschoten mentioned this pull request Jul 9, 2019

storage: persist minimum transaction timestamps in intents #38782

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: tolerate missing transaction records when pushing #33523

storage: tolerate missing transaction records when pushing #33523

nvanbenschoten commented Jan 5, 2019 •

edited

Loading

cockroach-teamcity commented Jan 5, 2019

nvanbenschoten commented Jan 7, 2019 •

edited

Loading

nvanbenschoten commented Jan 8, 2019

andreimatei left a comment

andreimatei left a comment

tbg left a comment

andreimatei left a comment

andreimatei left a comment

andreimatei left a comment

andreimatei left a comment

nvanbenschoten left a comment

nvanbenschoten commented Jan 10, 2019

craig bot commented Jan 10, 2019

tbg left a comment

storage: tolerate missing transaction records when pushing #33523

storage: tolerate missing transaction records when pushing #33523

Conversation

nvanbenschoten commented Jan 5, 2019 • edited Loading

cockroach-teamcity commented Jan 5, 2019

nvanbenschoten commented Jan 7, 2019 • edited Loading

nvanbenschoten commented Jan 8, 2019

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 10, 2019

craig bot commented Jan 10, 2019

Build succeeded

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 5, 2019 •

edited

Loading

nvanbenschoten commented Jan 7, 2019 •

edited

Loading