-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: tolerate missing transaction records when pushing #33523
storage: tolerate missing transaction records when pushing #33523
Conversation
1f1bc53
to
e1201df
Compare
Here's a state transition diagram that reflects the transaction state machine after this change. Note that
The diagram also reveals a redundancy that we could consider simplifying in a later change. If we had |
b280c0b
to
6a9ef2e
Compare
I rebased this on top of #33396 and updated it with a few leftover comments from that PR. It's now ready for a look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica_tscache.go, line 35 at r1 (raw file):
) { readOnlyUseReadCache := true if r.store.Clock().MaxOffset() == timeutil.ClocklessMaxOffset {
what's the story with this "clockless mode"? Was someone (Ben?) recently trying to delete all the code around it?
If we don't wipe it all, what's the deal with this patch? Does it... break anything?
pkg/storage/txnwait/txnqueue.go, line 466 at r2 (raw file):
pusheePriority := req.PusheeTxn.Priority var pusheeTxnTimer timeutil.Timer
this commit is just cleanup unrelated to the following one, right?
Why do you say that this could result in starvation if the clock is bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like that diagram!
A few things for now; will look more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.go, line 1031 at r3 (raw file):
func (*HeartbeatTxnRequest) flags() int { return isWrite | isTxn } func (*GCRequest) flags() int { return isWrite | isRange } func (*PushTxnRequest) flags() int { return isWrite | isAlone | updatesReadTSCache }
would you mind adding a comment to the new updatesTSCache
explaining that the push does it if it doesn't find a txn record but doesn't do it if it does find it (if indeed what you described in the diagram is real)
pkg/roachpb/errors.proto, line 127 at r3 (raw file):
// A BeginTransaction, HeartbeatTxn, or EndTransaction(commit=true) request // found an aborted transaction record. Another txn must have written this
wait - the "another txn" part is now false, isn't it?
And the "before out txn record was written" part is false... I think this whole comment should be rewritten, no?
pkg/roachpb/errors.proto, line 156 at r3 (raw file):
ABORT_REASON_ABORT_SPAN = 5; // A requests attempting to create a transaction record encountered a
s/requests/request
pkg/roachpb/errors.proto, line 157 at r3 (raw file):
// A requests attempting to create a transaction record encountered a // timestamp cache entry for the txn key, and the entry identifies this
s/timestamp cache entry/write timestamp cache entry (right?)
pkg/storage/gc_queue_test.go, line 718 at r3 (raw file):
txn.LastHeartbeat = hlc.Timestamp{WallTime: test.hb} } // Set a high Timestamp to make sure it does not matter. Only
do you remember how come you had to change this?
pkg/storage/replica_tscache.go, line 293 at r3 (raw file):
// Also look in the write timestamp cache to see if there is an entry for // this transaction, which would indicate this transaction has already been // finalized or was already aborted by a conccurent transaction. If there is
conccurent - you need to double something, but not anything. Romanians have the same strrugles.
pkg/storage/replica_tscache.go, line 302 at r3 (raw file):
// written intents at. We can't compare against the provisional commit // timestamp because it may have moved forward over the course of a single // epoch and we can't compare against the current epoch's original timestamp
nit: s/current epoch's original timestamp/(current epoch's) OrigTimestamp
Also this comment about OrigTimestamp I think belongs in the method comment, not here (here there's no OrigTimestamp to speak of even if we did want to compare against it). And I wanted to ask for hints about what the txnMinTSUpperBound
is about anyway in the method comment. Or rephrase the comment completely - see next.
Also :) - I think the part about "the minimum timestamp that the transaction could have written intents at" deserves more explanations. Cause this is all elegant and interesting, but... complex. Say something about how the check needs to be pared up with the actors who could have possibly intended to leave a marker in the wtscache. What timestamps would those people have used? They would have used an intent's timestamp (for push[abort]) or the epochZeroOrigTS for commits (presumably... but I haven't found that code).
pkg/storage/replica_tscache.go, line 320 at r3 (raw file):
nit:
// aborted us before our transaction record was written. It obeyed // the invariant that it couldn't create a transaction record for
I don't know if the use of an "invariant" really works here. An invariant is more like a propery that all states in a state space have, not about who does what. Like, the gossip module doesn't say it has an invariant about how no other code starts randomly sending gossip messages.
But do what you will.
pkg/storage/replica_tscache.go, line 339 at r3 (raw file):
} return true, minCommitTS, 0
0?!? Use a constant.
pkg/storage/batcheval/cmd_end_transaction.go, line 353 at r3 (raw file):
// will be too old. Replays of requests which attempt to create a new txn // record (BeginTransaction, HeartbeatTxn, or EndTransaction) never succeed // because EndTransaction inserts in the write timestamp cache, forcing the
would you mind pointing in this comment (or just telling me) where exactly the wTSCache is updated here?
pkg/storage/batcheval/eval_context.go, line 68 at r3 (raw file):
ContainsKey(key roachpb.Key) bool // CanCreateTxnRecord determines whether a transaction record can be created
nit: say that this needs to be called before a txn record is written, and say that it protects against replays (Begin/Heartbeat after Commit/Rollback/Push[Abort]) and against committing transactions that have already been pushed / aborted by others.
Based on cockroachdb#33523. Closes cockroachdb#25437. This is the final part of addressing cockroachdb#32971. The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them. To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits. Release note (performance improvement): Reduce the network and storage overhead of multi-Range transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be able to answer this question myself, but perhaps it's worth discussing somewhere public:
Txn record creation is "blocked" via an entry into the write ts cache (the no record -> pushtxnabort -> pushtxn case) by writing now()
. Is it possible that txn.orig
is larger than now
? After all, the txn may never have interacted with the node that houses the slot for the txn record. Right now, we anchor txns at the first write, but is this now critical for correctness? I think it'd be good to think through the case in which a txn record can be anywhere, including on ranges never touched by the txn otherwise. Are we relying on the push propagating the OrigTimestamp into the clock before generating now()
?
Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.go, line 1031 at r3 (raw file):
func (*HeartbeatTxnRequest) flags() int { return isWrite | isTxn } func (*GCRequest) flags() int { return isWrite | isRange } func (*PushTxnRequest) flags() int { return isWrite | isAlone | updatesReadTSCache }
There's no flag for the wTSCache, right? It's unexpected to not see it here.
pkg/storage/txnwait/txnqueue.go, line 466 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this commit is just cleanup unrelated to the following one, right?
Why do you say that this could result in starvation if the clock is bad?
I also don't understand how a manual clock (which I assume is what's meant) makes this loop hot. Perhaps now() is very close to expiration
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diagram also reveals a redundancy that we could consider simplifying in a later change. If we had PushTxn(ABORT) update the write timestamp cache then we wouldn't need TxnSpanGCThreshold at all.
Yeah, I happen to like this idea. Let's get rid of TxnSpanGCThreashold
! Would you mind leaving a TODO on the definition of TxnSpanGCThreashold
with your notes from here plus what we discussed: that we could simply have the GC process insert a span in the wtsc covers all the (range-local) txn records. Although what would that give us exactly, besides not allowing people to write (new) txn records below this GC threshold (tbh, I'm unsure if disallowing that is desirable or not).
I might be inclined to pull on that string. I think it'd make the GC process not be something that we constantly have to remember about and take into account.
still looking
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica_test.go, line 3195 at r3 (raw file):
// TestTxnRecordUnderTxnSpanGCThreshold verifies that aborting transactions // which haven't written their initial txn record yet does not lead to // anomalies. Precisely, verify that if the GC queue could potentially have
this comment is now stale, right?
pkg/storage/replica_test.go, line 3215 at r3 (raw file):
pusher := newTransaction("pusher", key, 1, tc.Clock()) // This pushee should never be allowed to write a txn record because it
stale comment
pkg/storage/replica_test.go, line 3297 at r3 (raw file):
// a transaction record, it must first abort it, which is recorded in the // write timestamp cache. Should we remove this logic or do we still like it? // Removing it would allow new transactions beneath the GC threshold, but
As I was saying in a review comment, I'd move discussion of this good stuff to the definition of TxnSpanGCThreshold
and perhaps link to that here.
pkg/storage/batcheval/cmd_end_transaction.go, line 353 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
would you mind pointing in this comment (or just telling me) where exactly the wTSCache is updated here?
you've told me, but I think a note here would be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica_tscache.go, line 86 at r3 (raw file):
// ensures that we can safely update the timestamp cache based // on this value. // TODO(nvanbenschoten): Is this necessary? Is this the right
@tbg , me and Nathan are discussing things here and would like some opinion.
The issue is that we're about to put something in the ts cache. That something needs to be <= the node's current clock.
The value that we're about to put is the timestamp to which we're pushing the pushee. This timestamp comes from the pusher (another node) is pushing to a timestamp <= its clock. The PushTxn
request doesn't have a header.Txn
filled in [1]. I think this means that it also doesn't have header.Timestamp
filled in (as clients don't put timestamps on non-transactional requests; the timestamps for these requests are filled in by the store). So, I think this all means that somewhere we need to be doing r.store.Clock().Update(pushee.Timestamp)
, as the patch has below. I suggested we do that during evaluation, in evalPushTxn()
. What do you think?
Moreover, the patch does reply.PusheeTxn.Timestamp = args.PushTo.Add(0, 1)
in evalPushTxn
because it might need to write the txn record with that timestamp. And so, particularly since we're generating a timestamp out of thin air, I think the node's clock should be updated immediately to account for that generated timestamp, right?
[1]
if cArgs.Header.Txn != nil { |
pkg/storage/batcheval/cmd_push_txn.go, line 138 at r3 (raw file):
// To determine which case we're in, we check whether the transaction could // ever write a transaction record. We do this by using the metadata from // the intent and attempting to synthesize a transaction record (TODO?). If
TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica_tscache.go, line 86 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
@tbg , me and Nathan are discussing things here and would like some opinion.
The issue is that we're about to put something in the ts cache. That something needs to be <= the node's current clock.
The value that we're about to put is the timestamp to which we're pushing the pushee. This timestamp comes from the pusher (another node) is pushing to a timestamp <= its clock. ThePushTxn
request doesn't have aheader.Txn
filled in [1]. I think this means that it also doesn't haveheader.Timestamp
filled in (as clients don't put timestamps on non-transactional requests; the timestamps for these requests are filled in by the store). So, I think this all means that somewhere we need to be doingr.store.Clock().Update(pushee.Timestamp)
, as the patch has below. I suggested we do that during evaluation, inevalPushTxn()
. What do you think?Moreover, the patch does
reply.PusheeTxn.Timestamp = args.PushTo.Add(0, 1)
inevalPushTxn
because it might need to write the txn record with that timestamp. And so, particularly since we're generating a timestamp out of thin air, I think the node's clock should be updated immediately to account for that generated timestamp, right?[1]
if cArgs.Header.Txn != nil {
Maybe the argument that the node's clock should be below any logical timestamp we operate with is not quite funded - in applyTimestampCache
we do nextTS := rTS.Next()
which I guess can return a value above the local clock, and we seem cool with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/batcheval/transaction.go, line 135 at r3 (raw file):
// Determine whether the transaction record could ever actually be written // in the future. We provide the intent's timestamp as the upper bound on
"intent"? What "intent"? Perhaps put more hints in the function comment about where this meta is coming from.
454d664
to
275bcc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews here!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica_test.go, line 3195 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
is cleaned up
Done.
pkg/storage/replica_test.go, line 4740 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
But it doesn't verify that it's finalized, it populates the write ts and pretends the txn is aborted, right? This comment somehow makes it seem like the push would return a COMMITTED proto based on some intel from the tsCache.
Done.
pkg/storage/replica_test.go, line 4955 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why did this get removed?
I think I was mixing up reply.Txn
and reply.PusheeTxn
. The test still makes sense. I reverted my changes here and fixed the test instead.
pkg/storage/replica_tscache.go, line 86 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Yes, this (calling
.Next()
and thus leaking a possible future timestamp) is a problem that we should address (though it's hopefully only a theoretical one). The problem is likely theoretical because reading a clock value from the hlc already also increments the logical (if it doesn't already do it to the wall time), so you need multiple logical ticks to pile up for something to get out of order. The code for the fix would benextTS := rTS.Next() hlc.Update(nextTS)but it's starting to really irk me that we're grabbing the hlc lock in so many places (this can be optimized if we had a "slightly stale" reading of the clock, because most of the time nextTS is below anyway and so nothing needs to be updated).
The PushTxn issue is also an issue. The fact that we upgrade the hlc "manually" is bad. The way this could work conceptually is that each request to another node has a
hlc.Now()
reading that is applied at the recipient before doing anything else with that message.I think the discussion should move to an issue, would you mind filing one if you haven't already?
I just opened #33632. We can discuss further there.
pkg/storage/replica_tscache.go, line 90 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I wonder what could happen without this. We'll put something in the timestamp cache that is a valid read of some node's clock, but perhaps not this one. We won't cause cascading updates (the original mantra of "never put manual timestamps into the clock" came from the manual offset looping around multiple times, until it exceeded MaxOffset and caused trouble), and we may return a lower observed timestamp than we would otherwise. Lowering an observed timestamp is only a problem if in doing so we let the a reader txn avoid an uncertainty restart, but that presupposes an intent above the accidentally "lower" observed timestamp. I think in the end this update doesn't matter for correctness (what really matters is that writes populate the hlc correctly with their provisional write ts), but it sure is hard to reason about it all.
My fear here was that without this the node could have a lower clock than the pushee's pushed timestamp. It could then transfer its lease and the new leaseholder could create a low water mark entry at a lower timestamp than pushee.Timestamp
. The transaction record could then be created with too low of a timestamp.
pkg/storage/replica_tscache.go, line 241 at r6 (raw file):
Could you make vs == nil more precise? The timestamp cache always has a low water mark, so you're going to get some timestamp. I assume you want v2 >= txn.orig?
Done.
= txn.ts would be better for long-running txns, but probably incorrect
Yes, this is exactly why this PR diverges from the RFC and performs the validation when creating transaction records instead of when committing them. This gives a transaction an upper-bound on its window of vulnerability instead of a window that scales proportionally with the duration of the transaction.
Also, do you have heuristics on how fast the timestamp cache's write ts rises? With all commits populating it, it'll be at a steady pace.
No faster than the read timestamp cache rising on every read, and that will have roughly the same effect. This effect also isn't new, so I'm not too worried about that causing a sudden regression.
Here's the math though:
interval skl key size = size(LocalTransactionSuffix) + size(anchor key) + size(uuid) = 20 + size(anchor key)
interval skl val size = encodedValSize = 32
interval skl node size = 32 (assuming an average tower height of 2)
(assuming 64 byte keys) interval skl entry = 146
interval skl page size = 32 MB
entries per page = 32 MB / 116 = 226,719 txns per page
so in order to reach the crossover point where the MinRetentionWindow of 10s is needed and we push past the standard 2 skl pages, we need to be performing 45,000 tps (non-1PC) on a given node's leaseholders. That's not crazy high, but I also don't expect to hit that any time soon.
pkg/storage/replica_tscache.go, line 254 at r6 (raw file):
Is it intentional that you're not updating v2 in the PushTxn(ABORT) on the right
No, that was a typo. We do this already in this PR.
but I'm curious if you've considered eager GC unconditionally which seems possible if you do update v2 on all aborting pushes
I haven't thought that all the way through to its conclusion, but I like this idea. This also ties in to the TODOs I scattered around with removing the TxnSpanGCThreshold checks. I added the idea to the TODO.
It would also simplify batch gc of transactions because after my suggestion, any finalized txn record can simply be removed because they're always protected by wTS, and we don't have to pessimistically check/populate the ts cache as we remove the records. It's also a nice invariant that finalized records are always protected by v2.
Yes, I completely agree. I'm going to save some of this for a follow-up PR, but you're right that there's still room to simplify.
pkg/storage/replica_tscache.go, line 312 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Or if the cache got rolled over due to size.
Done.
pkg/storage/replica_tscache.go, line 322 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this reason adequately named and is it returned anywhere else? Consider renaming and/or adding a new value that disambiguates this case.
Yes, it's the exact same reason we return when an actual ABORTED transaction record is found.
pkg/storage/batcheval/cmd_push_txn.go, line 76 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this important? Would things go awry if that changed in the future?
Yes, but it's not a correctness issue. Added to this comment.
pkg/storage/batcheval/cmd_push_txn.go, line 79 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mention that the expiration is "courtesy" -- if we'd simply abort txns right away that would usually result in worse performance, but it would still be correct.
Good point, done.
pkg/storage/batcheval/cmd_push_txn.go, line 91 at r6 (raw file):
How does it know that the coordinator has found out
Why does it care whether the coordinator has found out?
pkg/storage/batcheval/cmd_push_txn.go, line 264 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
for a
Done.
pkg/storage/batcheval/eval_context.go, line 69 at r6 (raw file):
isn't that a lower bound
I think it's an upper bound. It's bounding the possible values of the transactions minimum timestamp from the top.
Also clarify that it's write timestamp.
It doesn't really need to be the write timestamp. For instance, a transaction may never write at its OrigTimestamp.
It's weird that a bunch of info is here and then the state diagram is on the implementation, I went and looked up this comment because I wanted to understand the txnMinTSUpperBound. Consider moving both explanations in one place.
Done.
pkg/storage/batcheval/eval_context.go, line 83 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This makes it sound like there's nothing "correct" that they know. Hint at what they should pass instead.
Done.
pkg/storage/batcheval/transaction.go, line 117 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It also couldn't be written as ABORTED, right? (Not that that's important)
I think it could be. We allow rollbacks through.
bors r+ |
33523: storage: tolerate missing transaction records when pushing r=nvanbenschoten a=nvanbenschoten Informs #25437. Informs #32971. This is the second part of addressing #32971. The final part will be updating the txn client to stop sending BeginTxn requests. The change closely resembles what is laid out in the corresponding RFC sections (`PushTxn` and `QueryTxn / txnwait.Queue`). Both `PushTxn` requests and `QueryTxn` requests are adjusted to tolerate missing transaction records and to synthesize them based on the information pulled from intents if necessary. By hiding these details behind the request API abstraction, we don't actually need to modify the `txnwait.Queue` at all! The change does diverge from the RFC in two distinct ways. The first is that it introduces a new invariant about transaction record creation. Transaction records can now only ever be created by requests originating from their own coordinator (`BeginTxn`, `HeartbeatTxn`, and `EndTxn`). They can never be created in any state (not even ABORTED) by concurrent actors. This is a much stronger invariant than what previously existed (and even what was proposed in the RFC), and it simplifies a lot of logic by dramatically refining the transaction state machine. We now know that for a transaction record to exist, it must have been created by its own coordinator and it must have checked with `CanCreateTxnRecord`. Making this new invariant work required the second major divergence from the RFC. The RFC suggested that we use the read timestamp cache for successful transaction timestamp pushes before a transaction record is written. This PR takes this a step further and uses the write timestamp cache for successful transaction aborts before a transaction record is written. In doing this, we emerge with a very sane timestamp cache policy - the read timestamp cache is used to push transaction timestamps and the write timestamp cache is used to abort them entirely (which it was already essentially being used for). The txnID marker on these timestamp cache entries becomes the transaction that initiated the push/abort. Transactions then consult both of these sources before creating their transaction record in `CanCreateTxnRecord`. In doing this, we avoid an entire class of situations where concurrent transactions created and abandoned transaction records for another transaction, requiring eventual GC. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Build succeeded |
Based on cockroachdb#33523. Closes cockroachdb#25437. This is the final part of addressing cockroachdb#32971. The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them. To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits. Release note (performance improvement): Reduce the network and storage overhead of multi-Range transactions.
Based on cockroachdb#33523. Closes cockroachdb#25437. This is the final part of addressing cockroachdb#32971. The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them. To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits. Release note (performance improvement): Reduce the network and storage overhead of multi-Range transactions.
33566: kv: stop using BeginTransaction requests r=nvanbenschoten a=nvanbenschoten Based on #33523. Closes #25437. This is the final part of addressing #32971. The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them. To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits. Release note (performance improvement): Reduce the network and storage overhead of multi-Range transactions. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 19 of 19 files at r7, 1 of 1 files at r8, 19 of 19 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica_tscache.go, line 90 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
My fear here was that without this the node could have a lower clock than the pushee's pushed timestamp. It could then transfer its lease and the new leaseholder could create a low water mark entry at a lower timestamp than
pushee.Timestamp
. The transaction record could then be created with too low of a timestamp.
👍
pkg/storage/replica_tscache.go, line 241 at r6 (raw file):
instead of when committing them
ah, I had forgotten about that. Makes sense and I'm not concerned about the timestamp cache in that case.
pkg/storage/replica_tscache.go, line 322 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yes, it's the exact same reason we return when an actual ABORTED transaction record is found.
You don't think it's worth having a separate reason for this? (Or more generally, metrics)?
pkg/storage/batcheval/cmd_push_txn.go, line 91 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
How does it know that the coordinator has found out
Why does it care whether the coordinator has found out?
If the coordinator is still running, the txn might try to read its values, but the intents are gone and we might've cleaned up the abort span as well.
pkg/storage/batcheval/cmd_push_txn.go, line 272 at r9 (raw file):
// If the transaction record was already present, persist the updates to it. // If not, then we don't want to create it. This could allow for finalized // transactions to be revived. Instead, we obey the invariant that only the
Only if we created a PENDING one. We could create ABORTED records, it just doesn't seem to make much sense.
pkg/storage/batcheval/eval_context.go, line 69 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
isn't that a lower bound
I think it's an upper bound. It's bounding the possible values of the transactions minimum timestamp from the top.
Also clarify that it's write timestamp.
It doesn't really need to be the write timestamp. For instance, a transaction may never write at its OrigTimestamp.
It's weird that a bunch of info is here and then the state diagram is on the implementation, I went and looked up this comment because I wanted to understand the txnMinTSUpperBound. Consider moving both explanations in one place.
Done.
A lower bound on a set (plus partial order) is an element that is less than or equal to any element of the set. It still seems to me that you want a lower bound. As am example, hlc.Timestamp{}
is a lower bound but hlc.MaxTimestamp
is an upper bound.
Cleanup from cockroachdb#33523. This commit weakens the semantics of `Replica.CanCreateTxnRecord` slightly to allow it to ignore the `TxnSpanGCThreshold`. `CanCreateTxnRecord` no longer consults the `TxnSpanGCThreshold`, so it doesn't prevent the creation of new transaction records beneath it. That's ok though, because we still prevent the recreation of aborted and GCed transaction records, which was the important part of the check. This protection is now provided by updating the write timestamp cache on successful `TxnPush(ABORT)` requests (added in cockroachdb#33523). The GC queue will always `PushTxn(ABORT)` transactions if it finds them `PENDING`, so the `TxnSpanGCThreshold` check isn't giving us anything important. For a visual summary of this change, look at the update to the state machine diagram in `replica_tscache.go`. _### Migration Safety In the next release cycle, this change will allow us to remove the `TxnSpanGCThreshold` entirely because the protection it was providing is now entirely handled by the timestamp cache. In the meantime, we need to keep setting it to ensure that the protection still works properly on old (v2.1) leaseholders. However, we're safe to no-longer consult it at all on new (v2.2) leaseholders. This is because even if a new node is given the lease from an old node that isn't correctly bumping the timestamp cache on `PushTxn` requests, we know that the old node will have a clock at least as large as a txn in question's OrigTimestamp. That means that the new leaseholder will necessarily bump its write timestamp cache to at least as large as this txn's OrigTimestamp, so `Replica.CanCreateTxnRecord` will always fail. Release note: None
As of cockroachdb#33523, this test doesn't really make sense. The transaction model is much more flexible to scenarios like this where a `PushTxn` is issued before the transaction's record is written. This flexibility required more targeted testing, which was added in cockroachdb#33523. As a result, we don't need such a high-level integration test any more. Instead, the following three subtests of `TestCreateTxnRecord` sufficiently test that such a situation is handled correctly and returns a retriable error like we were expecting: - `TestCreateTxnRecord/begin_transaction_after_end_transaction_(abort)` - `TestCreateTxnRecord/heartbeat_transaction_after_end_transaction_(abort)` - `TestCreateTxnRecord/end_transaction_(commit)_after_end_transaction_(abort)` Release note: None
34786: kv: delete TestDelayedBeginRetryable, which was replaced by unit tests r=nvanbenschoten a=nvanbenschoten As of #33523, this test doesn't really make sense. The transaction model is much more flexible to scenarios like this where a `PushTxn` is issued before the transaction's record is written. This flexibility required more targeted testing, which was added in #33523. As a result, we don't need such a high-level integration test any more. Instead, the following three subtests of `TestCreateTxnRecord` sufficiently test that such a situation is handled correctly and returns a retriable error like we were expecting: - `TestCreateTxnRecord/begin_transaction_after_end_transaction_(abort)` - `TestCreateTxnRecord/heartbeat_transaction_after_end_transaction_(abort)` - `TestCreateTxnRecord/end_transaction_(commit)_after_end_transaction_(abort)` Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Cleanup from cockroachdb#33523. This commit weakens the semantics of `Replica.CanCreateTxnRecord` slightly to allow it to ignore the `TxnSpanGCThreshold`. `CanCreateTxnRecord` no longer consults the `TxnSpanGCThreshold`, so it doesn't prevent the creation of new transaction records beneath it. That's ok though, because we still prevent the recreation of aborted and GCed transaction records, which was the important part of the check. This protection is now provided by updating the write timestamp cache on successful `TxnPush(ABORT)` requests (added in cockroachdb#33523). The GC queue will always `PushTxn(ABORT)` transactions if it finds them `PENDING`, so the `TxnSpanGCThreshold` check isn't giving us anything important. For a visual summary of this change, look at the update to the state machine diagram in `replica_tscache.go`. _### Migration Safety In the next release cycle, this change will allow us to remove the `TxnSpanGCThreshold` entirely because the protection it was providing is now entirely handled by the timestamp cache. In the meantime, we need to keep setting it to ensure that the protection still works properly on old (v2.1) leaseholders. However, we're safe to no-longer consult it at all on new (v2.2) leaseholders. This is because even if a new node is given the lease from an old node that isn't correctly bumping the timestamp cache on `PushTxn` requests, we know that the old node will have a clock at least as large as a txn in question's OrigTimestamp. That means that the new leaseholder will necessarily bump its write timestamp cache to at least as large as this txn's OrigTimestamp, so `Replica.CanCreateTxnRecord` will always fail. Release note: None
34778: storage: don't consult TxnSpanGCThreshold when creating txn records r=nvanbenschoten a=nvanbenschoten Cleanup from #33523. This commit weakens the semantics of `Replica.CanCreateTxnRecord` slightly to allow it to ignore the `TxnSpanGCThreshold`. `CanCreateTxnRecord` no longer consults the `TxnSpanGCThreshold`, so it doesn't prevent the creation of new transaction records beneath it. That's ok though, because we still prevent the recreation of aborted and GCed transaction records, which was the important part of the check. This protection is now provided by updating the write timestamp cache on successful `TxnPush(ABORT)` requests (added in #33523). The GC queue will always `PushTxn(ABORT)` transactions if it finds them `PENDING`, so the `TxnSpanGCThreshold` check isn't giving us anything important. For a visual summary of this change, look at the update to the state machine diagram in `replica_tscache.go`. ### Migration Safety In the next release cycle, this change will allow us to remove the `TxnSpanGCThreshold` entirely because the protection it was providing is now entirely handled by the timestamp cache. In the meantime, we need to keep setting it to ensure that the protection still works properly on old (v2.1) leaseholders. However, we're safe to no-longer consult it at all on new (v2.2) leaseholders. This is because even if a new node is given the lease from an old node that isn't correctly bumping the timestamp cache on `PushTxn` requests, we know that the old node will have a clock at least as large as a txn in question's OrigTimestamp. That means that the new leaseholder will necessarily bump its write timestamp cache to at least as large as this txn's OrigTimestamp, so `Replica.CanCreateTxnRecord` will always fail. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This was discussed in cockroachdb#33523 (review) but I missed digging into it at the time. I still don't think I have my head fully wrapped around it, but I do enough to know that it's not something I want to start pulling on right now. Release note: None
This was discussed in cockroachdb#33523 (review) but I missed digging into it at the time. I still don't think I have my head fully wrapped around it, but I do enough to know that it's not something I want to start pulling on right now. Release note: None
35077: workload: add indexes workload r=nvanbenschoten a=nvanbenschoten Closes #34912. This adds a new workload called `indexes`. Unlike `kv`, the workload is specifically designed to stress secondary indexes. Its schema looks like: ``` CREATE TABLE indexes ( key UUID NOT NULL PRIMARY KEY, col0 INT NOT NULL, col1 INT NOT NULL, col2 INT NOT NULL, col3 INT NOT NULL, col4 INT NOT NULL, col5 INT NOT NULL, col6 INT NOT NULL, col7 INT NOT NULL, col8 INT NOT NULL, col9 INT NOT NULL, payload BYTES NOT NULL ) ``` and it includes knobs to specify how many of the "col" columns should be indexed and whether they should be indexed using a UNIQUE specifier. The workload also includes a knob to specify the size of the `payload` column. Release note: None 35116: sql: add support for extra session vars for pg compatibility r=knz a=knz Fixes #35109. This adds compatibility support for the following variables, with only the default values that make sense in CockroachDB: - `row_security` - `synchronize_seqscans`, - `lock_timeout`, - `idle_in_transaction_session_timeout` Release note: None 35134: storageccl: leave params in workload URIs r=dt a=dt workload URIs do not contain anything sensitive and including them in the job make it easier see what it was actually doing. Release note: none. 35151: sql: tolerate non-existent databases for plan cache invalidation r=knz a=knz Fixes #35145. Release note (bug fix): CockroachDB again properly reports when a database used during PREPARE does not exist any more when EXECUTE is used. 35162: storage: remove TODOs in cmd_push_txn r=nvanbenschoten a=nvanbenschoten This was discussed in #33523 (review) but I missed digging into it at the time. I still don't think I have my head fully wrapped around it, but I do enough to know that it's not something I want to start pulling on right now. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com>
38782: storage: persist minimum transaction timestamps in intents r=nvanbenschoten a=nvanbenschoten Fixes #36089. Informs #37199. This commit addresses the second concern from #37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. @tbg I'm assigning you because you reviewed most of the lazy transaction record stuff (especially #33523), but let me know if you'd like me to find a different reviewer. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Informs #25437.
Informs #32971.
This is the second part of addressing #32971. The final part will be updating the txn client to stop sending BeginTxn requests.
The change closely resembles what is laid out in the corresponding RFC sections (
PushTxn
andQueryTxn / txnwait.Queue
). BothPushTxn
requests andQueryTxn
requests are adjusted to tolerate missing transaction records and to synthesize them based on the information pulled from intents if necessary. By hiding these details behind the request API abstraction, we don't actually need to modify thetxnwait.Queue
at all!The change does diverge from the RFC in two distinct ways. The first is that it introduces a new invariant about transaction record creation. Transaction records can now only ever be created by requests originating from their own coordinator (
BeginTxn
,HeartbeatTxn
, andEndTxn
). They can never be created in any state (not even ABORTED) by concurrent actors. This is a much stronger invariant than what previously existed (and even what was proposed in the RFC), and it simplifies a lot of logic by dramatically refining the transaction state machine. We now know that for a transaction record to exist, it must have been created by its own coordinator and it must have checked withCanCreateTxnRecord
.Making this new invariant work required the second major divergence from the RFC. The RFC suggested that we use the read timestamp cache for successful transaction timestamp pushes before a transaction record is written. This PR takes this a step further and uses the write timestamp cache for successful transaction aborts before a transaction record is written. In doing this, we emerge with a very sane timestamp cache policy - the read timestamp cache is used to push transaction timestamps and the write timestamp cache is used to abort them entirely (which it was already essentially being used for). The txnID marker on these timestamp cache entries becomes the transaction that initiated the push/abort. Transactions then consult both of these sources before creating their transaction record in
CanCreateTxnRecord
. In doing this, we avoid an entire class of situations where concurrent transactions created and abandoned transaction records for another transaction, requiring eventual GC.