Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: make BeginTxn optional proposal #25437

Closed
andreimatei opened this issue May 11, 2018 · 13 comments
Closed

kv: make BeginTxn optional proposal #25437

andreimatei opened this issue May 11, 2018 · 13 comments
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@andreimatei
Copy link
Contributor

andreimatei commented May 11, 2018

If a transaction sends a batch that touches multiple ranges, and if the sub-batch for the anchor range succeeds, but another one fails, we are now in a situation where the txn record has been written, but the client.Txn is not aware of that fact so it will attempt to send another BeginTxn with the next batch.
So, if the client choses to ignore the error (say, it's an expected ConditionFailedError on an InitPut), and send another batch, that batch will have an BeginTxn which fails with RETRY_POSSIBLE_REPLAY because the txn record already exists. If you're inside a db.Txn function, for example, you'll retry endlessly.

On the other hand, if we don't send the BeginTxn again then, in case it does exist, it would fail (because of its replay protection via the timestamp cache).

Discussing this with @tschottdorf, this is the proposal. We're moving towards a world with best-effort BeginTxn and idempotent EndTxn.

  • We'll have client.Txn never send more than one BeginTxn. Once a BeginTxn has been sent, regardless of whether it succeeded or not (or if it's unclear whether it did), there's no second BeginTxn. This way there's no failure when continuing to write in a transaction whose BeginTxn batch has previously failed. (*)
  • We'll make EndTxn succeed if it doesn't find a transaction record. In this case, the EndTxn will simply write the transaction record and declare success. The txn record might not exist because it was never written, or it might not exist because this is a replay and it had already been cleaned up. Either case is fine.
  • We'll make it so that the TxnCoordSender never sends more than one EndTxn. This ensures that a commit doesn't race with a rollback such that the rollback succeeds and then the commit also succeeds.
  • EndTxn will continue to populate the timestamp cache, as it currently does, and BeginTxn will continue to consult it. This way, we continue to be protected against a replayed BeginTxn recreating a phantom txn record after the corresponding EndTxn.

(*) This also simplifies the logic in client.Txn a bunch, which currently uses a combination of TxnProto.Writing and writingTxnRecord to figure out whether a BeginTxn/EndTxn needs to be sent. Beside being complex, this logic is also broken because it's not in sync with the logic used by the TxnCoordSender when starting the hearbeat loop. It's currently possible for the TCS to think that there might be intents laid down and so there's a heartbeat loop, but for the client.Txn to think that the transaction is read-only and so an EndTxn does not need to be sent (in which case nobody stops the heartbeat loop). Attempting to fix these things led to finding this current bug.

cc @tschottdorf @nvanbenschoten @bdarnell

@andreimatei andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-client Relating to the KV client and the KV interface. labels May 11, 2018
@andreimatei andreimatei self-assigned this May 11, 2018
@andreimatei
Copy link
Contributor Author

gist for a test demonstrating the issue:
https://gist.github.com/andreimatei/39a88ba19b11d4877dcb9ddbf688eb19

@andreimatei
Copy link
Contributor Author

Update the proposal above.

@bdarnell
Copy link
Contributor

So, if the client choses to ignore the error (say, it's an expected ConditionFailedError on an InitPut),

Note that we do not yet support continuing after an error like this, and whenever this comes up we usually decide we'd rather add an option to turn ConditionFailedError into a non-error return value instead of supporting continuing after an error.

We'll make EndTxn succeed if it doesn't find a transaction record

This is a big change from the current semantics. A missing txn record is currently equivalent to an aborted one. I'll need a lot more convincing that "either case is fine". Is the transaction state machine just obsolete because it predates our current uses of the timestamp cache and abort span?

EndTxn will continue to populate the timestamp cache, as it currently does, and BeginTxn will continue to consult it. This way, we continue to be protected against a replayed BeginTxn recreating a phantom txn record after the corresponding EndTxn.

What about repeated EndTxns without BeginTxn? EndTxn doesn't consult the timestamp cache.

If we can (sometimes) succeed without a BeginTxn, why would we ever send one? Is BeginTxn now simply the first HeartbeatTxn?

andreimatei referenced this issue May 14, 2018
Release note (sql change): introduce experimental statements `CREATE ROLE`
and `DROP ROLE` to create/delete roles.

Part of [role-based access
control](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20171220_sql_role_based_access_control.md).

This is almost exactly the same as CREATE/DROP USER, with the change of a
single field. Thus, we do the following:
* introduce wrapped plans in CCL code
* in CCL: check license then call into OSS code
* lookup "isRole" on conflict for clearer error message
* add node-creation helpers to `PlanHookState`

We really need to improve CCL/OSS interaction. eg:
* better OSS-build errors: "unknown statement type" sucks
* make more OSS code reusable from CCL: maybe export planner and nodes
@andreimatei
Copy link
Contributor Author

Tobi, Ben and I discussed this offline. Saying that "one can't continue with a txn after a non-retriable error" seems indeed good for now, so that takes away the pressure from doing anything here.

However the proposal generally seemed like a good idea.

What about repeated EndTxns without BeginTxn? EndTxn doesn't consult the timestamp cache.

I think the answer is that, if there's any writes in the EndTxn batch, then there'll be a WriteTooOldError. Otherwise, the EndTxn will succeed but it will be inconsequential. This relies on the fact that a client never sends a Rollback while it's waiting for the results of a Commit - otherwise declaring success on a EndTxn(commit=true) that doesn't find a record could be bad.

If we can (sometimes) succeed without a BeginTxn, why would we ever send one? Is BeginTxn now simply the first HeartbeatTxn?

Indeed, the BeginTxn would not be strictly needed, but it seems like a good idea to send it early in order to stave of would-be pushers.

@andreimatei andreimatei removed their assignment May 18, 2018
@andreimatei andreimatei changed the title kv: mixed-success in batches with BeginTransaction can lead to endless retries kv: make BeginTxn optional proposal May 18, 2018
@andreimatei andreimatei removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 18, 2018
@bdarnell
Copy link
Contributor

I think the answer is that, if there's any writes in the EndTxn batch, then there'll be a WriteTooOldError. Otherwise, the EndTxn will succeed but it will be inconsequential.

Why would it be inconsequential? Note that EndTxn is normally sent in a batch by itself for any non-1PC transaction.

This relies on the fact that a client never sends a Rollback while it's waiting for the results of a Commit - otherwise declaring success on a EndTxn(commit=true) that doesn't find a record could be bad.

Are you sure about that? Today there's tryAsyncAbort, which can happen at any time, and I think even then we could attempt a commit and then abort while the commit is in an ambiguous state.

@andreimatei
Copy link
Contributor Author

Are you sure about that? Today there's tryAsyncAbort, which can happen at any time, and I think even then we could attempt a commit and then abort while the commit is in an ambiguous state.

I'm currently getting rid of tryAsyncAbort.
Attempting a rollback when the commit state is ambiguous I think would be OK as long as:
a) we're no longer waiting for the result of the commit / the client has been informed that it's ambiguous
b) no conclusion is drawn from the success of the rollback - i.e. we don't tell the client that we rolled back when in fact we might have committed.
Attempting rollbacks for cleanup should be permissible.

Does this answer your concern?

@bdarnell
Copy link
Contributor

Yes, I think so.

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018
@tbg tbg added this to the 2.2 milestone Jul 22, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@nvanbenschoten
Copy link
Member

Why would it be inconsequential? Note that EndTxn is normally sent in a batch by itself for any non-1PC transaction.

Because it won't be committing anything that should already be committed, right? It could be pointing at intents that don't exist because they have already been resolved, but we already handle that case.

@bdarnell
Copy link
Contributor

Because it won't be committing anything that should already be committed, right? It could be pointing at intents that don't exist because they have already been resolved, but we already handle that case.

My concern is that this opens the door for a transaction to transition from an aborted state to a committed state, and for some of its intents to be resolved while it is considered to be aborted (by a concurrent pusher) and others to be resolved while it is committed. We currently prevent this by the combination of the transaction state machine (specifically that a transaction is only allowed to transition to a committed state from a pending state) and the timestamp cache. If we relax the transaction state transition rules, we'd be relying solely on the timestamp cache (as I understand it). Is that sufficient? It might be, but I don't think I've seen a strong enough argument for that yet.

@nvanbenschoten
Copy link
Member

My reading of the transaction abortion logic is that a push that doesn't find a txn record writes one in the aborted state. It then leaves this record around until the record is GCed or an EndTransaction(commit=false) comes along. This means that as long as a client never sends EndTransaction(commit=false) and then EndTransaction(commit=true), it won't be able to move a transaction from aborted to committed. I don't think any of this needs to change.

This is related to this section in the parallel commits RFC. It's also very similar to what we do with the txn AbortSpan and the poison option.

@nvanbenschoten
Copy link
Member

@andreimatei and I spent a while talking about this today. The conclusion we came to is that we shouldn't need the timestamp cache to store txn IDs at all in order to prevent aborted txns from becoming committed or to prevent committed txn replays from causing issues. Here are all of the states possible (I think):

  • txn aborted, txn record cleaned up immediately => must have been aborted by client which can't have also sent a commit req (see kv: make BeginTxn optional proposal #25437 (comment))
  • txn aborted, txn record cleaned up by GC => TxnSpanGCThreshold blocks recreation of txn record
  • txn aborted, txn record not cleaned up => must have been aborted by other txn. Txn record will prevent client from committing
  • txn committed, txn record cleaned up immediately => any writes will hit WriteTooOld errors and the txn won't be able to commit again*
  • txn committed, txn record not cleaned up => same story and also protected by TransactionCommittedStatusError check

*Unfortunately this isn't actually true because we allow transactions that hit WriteTooOld errors to commit without client buy-in here and here. I think this alone necessitates the timestamp cache check. @bdarnell I'm curious if you have any feelings about this optimization.

In the meantime, checking the timestamp cache for EndTxn requests instead of BeginTxn requests should provide the same protection, so it's not a blocker for this change.

@bdarnell
Copy link
Contributor

The conclusion we came to is that we shouldn't need the timestamp cache to store txn IDs at all in order to prevent aborted txns from becoming committed or to prevent committed txn replays from causing issues.

What's this talk about txn IDs in the timestamp cache? The timestamp cache already stores txn IDs (so that txns can overwrite values that they read at the same timestamp).

Unfortunately this isn't actually true because we allow transactions that hit WriteTooOld errors to commit without client buy-in here and here. I think this alone necessitates the timestamp cache check. @bdarnell I'm curious if you have any feelings about this optimization.

The former case (advance timestamp in EndTransaction if no refresh spans) makes a difference in ycsb, i think (it affects blind upserts that experience contention). I think the replica-level retry on WriteTooOldError is ancient and I'm skeptical that it provides much value (without this retry we stop laying down intents for a batch after the first WriteTooOldError; with it we stop after the second).

There are two ways that txn records get cleaned up by the GC operation: the GC queue (which sets TxnSpanGCThreshold) and intentResolver (which does not). The latter happens when an EndTransaction runs and can't delete the record immediately because it has external intents. I think this is generally equivalent to the "immediate" case since there is an on-disk record until it's done but it muddies the list of cases a bit.

@nvanbenschoten
Copy link
Member

What's this talk about txn IDs in the timestamp cache?

Sorry not txn IDs, txn keys in the write timestamp cache on EndTransactionRequests.

The former case (advance timestamp in EndTransaction if no refresh spans) makes a difference in ycsb, i think (it affects blind upserts that experience contention).

Yeah, it affects blind upserts that experience contention. Unfortunately ycsb performs UPDATEs which read first, so it doesn't take advantage of the optimisation.

I think the replica-level retry on WriteTooOldError is ancient and I'm skeptical that it provides much value (without this retry we stop laying down intents for a batch after the first WriteTooOldError; with it we stop after the second).

Continuing to lay down intents after the first WriteTooOldError is fine. The part that makes this dangerous is c6f9b92, which allows us to commit after doing so without client buy-in.

I think this is generally equivalent to the "immediate" case since there is an on-disk record until it's done but it muddies the list of cases a bit.

Yes, we should still be protected by WriteTooOld errors.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 7, 2019
Informs cockroachdb#25437.
Informs cockroachdb#32971.

This is the first part of addressing cockroachdb#32971. Part two will update
concurrent txns to not immediately abort missing txn records and
part three will update the txn client to stop sending BeginTxn
records.

The change closely resembles what was laid out in the corresponding
RFC sections. `HeartbeatTxn` and `EndTxn` are both updated to permit
the creation of transaction records when they finds that one is missing.

To prevent replays from causing issues, they *both* check the write
timestamp cache when creating a transaction record. This is one area
where the change diverges from the RFC. Instead of having `EndTxn`
always check the write timestamp cache, it only does so if it is
creating a txn record from scratch. To make this safe, `HeartbeatTxn`
also checks the write timestamp cache if it is creating a txn record
from scratch. This prevents a long running transaction from increasingly
running the risk of being aborted by a lease transfer as its `EndTxn`
continues to be delayed. Instead, a lease transfer will only abort a
transaction if it comes before the transaction record creation, which
should be within a second of the transaction starting.

The change pulls out a new `CanCreateTxnRecord` method, which has the
potential of being useful for detecting eagerly GCed transaction records
and solving the major problem from the RFC without an extra QueryIntent.
This is what Andrei was pushing for before. More details are included
in TODOs within the PR.

_### Migration Safety

Most of the changes to the transaction state machine are straightforward
to validate. Transaction records can still never be created without
checking for replays and only an EndTxn from the client's sync path
can GC an ABORTED txn record. This means that it is still impossible for
a transaction to move between the two finalized statuses (at some point
it would be worth it to draw this state machine).

The one tricky part is ensuring that the changes in this PR are safe
when run in mixed-version clusters. The two areas of concern are:
1. lease transfers between a new and an old cluster on a range that
should/does contain a transaction record.
2. an old txn client that is not expecting HeartbeatTxn and EndTxn
requests to create txn records if they are found to be missing.
3. an old txn client may not expect a HeartbeatTxn to hit the
write timestamp cache if run concurrently with an EndTxn.

The first concern is protected by the write timestamp cache. Regardless
of which replica creates that transaction record from a BeginTxn req or
a HeartbeatTxn req (on the new replica), it will first need to check
the timestamp cache. This prevents against any kind of replay that could
create a transaction record that the old replica is not prepared to
handle.

We can break the second concern into two parts. First, an old txn client
will never send an EndTxn without having sent a successful BeginTxn, so
the new behavior there is not an issue. Second, an old txn client may
send a HeartbeatTxn concurrently with a BeginTxn. If the heartbeat wins,
it will create a txn record on a new replica. If the BeginTxn evaluates
on a new replica, it will be a no-op. If it evaluates on an old replica,
it will result in a retryable error.

The third concern is specific to the implementation of the heartbeat loop
itself. If a HeartbeatTxn loses a race with an EndTxn, it may get an
aborted error after checking the timestamp cache. This is desirable from
the replica-side, but we'd need to ensure that the client will handle it
correctly. This PR adds some protection (see `sendingEndTxn`) to the txn
client to prevent this case from causing weirdness on the client, but I
don't think this could actually cause issues even without this protection.
The reason is that the txn client might mark itself as aborted due to the
heartbeat, but this will be overwritten when the EndTxn returns and the sql
client will still hear back from the successful EndTxn. This must have
actually always been an issue because it was always possible for a committed
txn record to be GCed and then written as aborted later, at which point a
concurrent heartbeat could have observed it.

I'd like Andrei to sign off on this last hazard, as he's thought about
this kind of thing more than anybody.

Release note: None
craig bot pushed a commit that referenced this issue Jan 7, 2019
33396: storage: allow HeartbeatTxn and EndTxn requests to create txn records r=nvanbenschoten a=nvanbenschoten

Informs #25437.
Informs #32971.

This is the first part of addressing #32971. Part two will update concurrent txns to not immediately abort missing txn records and part three will update the txn client to stop sending BeginTxn records.

The change closely resembles what was laid out in the corresponding RFC sections. `HeartbeatTxn` and `EndTxn` are both updated to permit the creation of transaction records when they finds that one is missing.

To prevent replays from causing issues, they *both* check the write timestamp cache when creating a transaction record. This is one area where the change diverges from the RFC. Instead of having `EndTxn` always check the write timestamp cache, it only does so if it is creating a txn record from scratch. To make this safe, `HeartbeatTxn` also checks the write timestamp cache if it is creating a txn record from scratch. This prevents a long-running transaction from increasingly running the risk of being aborted by a lease transfer as its `EndTxn` continues to be delayed. Instead, a lease transfer will only abort a transaction if it comes before the transaction record creation, which should be within a second of the transaction starting.

The change pulls out a new `CanCreateTxnRecord` method, which has the potential of being useful for detecting eagerly GCed transaction records and solving the major problem from the RFC without an extra QueryIntent. This is what Andrei was pushing for before. More details are included in TODOs within the PR.

### Migration Safety

Most of the changes to the transaction state machine are straightforward to validate. Transaction records can still never be created without checking for replays and only an EndTxn from the client's sync path can GC an ABORTED txn record. This means that it is still impossible for a transaction to move between the two finalized statuses (at some point it would be worth it to draw this state machine).

The one tricky part is ensuring that the changes in this PR are safe when run in mixed-version clusters. The two areas of concern are:
1. lease transfers between a new and an old cluster on a range that
should/does contain a transaction record.
2. an old txn client that is not expecting HeartbeatTxn and EndTxn requests to create txn records if they are found to be missing.
3. an old txn client may not expect a HeartbeatTxn to hit the write timestamp cache if run concurrently with an EndTxn.

The first concern is protected by the write timestamp cache. Regardless of which replica creates that transaction record from a BeginTxn req or a HeartbeatTxn req (on the new replica), it will first need to check the timestamp cache. This prevents against any kind of replay that could create a transaction record that the old replica is not prepared to handle.

We can break the second concern into two parts. First, an old txn client will never send an EndTxn without having sent a successful BeginTxn, so the new behavior there is not an issue. Second, an old txn client may send a HeartbeatTxn concurrently with a BeginTxn. If the heartbeat wins, it will create a txn record on a new replica. If the BeginTxn evaluates on a new replica, it will be a no-op. If it evaluates on an old replica, it will result in a retryable error.

The third concern is specific to the implementation of the heartbeat loop itself. If a HeartbeatTxn loses a race with an EndTxn, it may get an aborted error after checking the timestamp cache. This is  desirable from the replica-side, but we'd need to ensure that the client will handle it correctly. This PR adds some protection (see `sendingEndTxn`) to the txn client to prevent this case from causing weirdness on the client, but I don't think this could actually cause issues even without this protection.
The reason is that the txn client might mark itself as aborted due to the heartbeat, but this will be overwritten when the EndTxn returns and the sql client will still hear back from the successful EndTxn. This must have actually always been an issue because it was always possible for a committed txn record to be GCed and then written as aborted later, at which point a concurrent heartbeat could have observed it.

I'd like Andrei to sign off on this last hazard, as he's thought about this kind of thing more than anybody.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 8, 2019
Informs cockroachdb#25437.
Informs cockroachdb#32971.

This is the second part of addressing cockroachdb#32971. The final part will be updating
the txn client to stop sending BeginTxn requests.

The change closely resembles what is laid out in the corresponding RFC sections
(`PushTxn` and `QueryTxn / txnwait.Queue`). Both `PushTxn` requests and `QueryTxn`
requests are adjusted to tolerate missing transaction records and to synthesize
them based on the information pulled from intents if necessary. By hiding these
details behind the request API abstraction, we don't actually need to modify
the `txnwait.Queue` at all!

The change does diverge from the RFC in two distinct ways. The first is that it
introduces a new invariant about transaction record creation. Transaction can
now only ever be created by requests originating from their own coordinator
(`BeginTxn`, `HeartbeatTxn`, and `EndTxn`). They can never be created in any
state (not even ABORTED) by concurrent actors. This is actually a much stronger
invariant than what previously existed (and even what was proposed in the RFC),
and it simplifies a lot of logic by dramatically refining the transaction state
machine. We now know that for a transaction record to exist, it must have been
created by its own coordinator and it must have checked with `CanCreateTxnRecord`.

Making this new invariant work required the second major divergence from the
RFC. The RFC suggested that we use the read timestamp cache for successful
transaction timestamp pushes before a transaction record was written. This PR
takes this a step further and uses the write timestamp cache for successful
transaction aborts before a transaction record was written. In doing this, we
emerge with a very sane timestamp cache policy - the read timestamp cache is
used to push transaction timestamps and the write timestamp cache is used to
abort them entirely (which it was already essentially being used for). The txnID
marker on these timestamp cache entries then becomes the transaction that
initiated the push/abort. Transactions then consult both of these sources before
creating their transaction record in `CanCreateTxnRecord`. In doing this, we
avoid an entire class of situations where concurrent transactions created and
abandoned transaction records for another transaction, requiring eventual GC.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 9, 2019
Informs cockroachdb#25437.
Informs cockroachdb#32971.

This is the second part of addressing cockroachdb#32971. The final part will be updating
the txn client to stop sending BeginTxn requests.

The change closely resembles what is laid out in the corresponding RFC sections
(`PushTxn` and `QueryTxn / txnwait.Queue`). Both `PushTxn` requests and `QueryTxn`
requests are adjusted to tolerate missing transaction records and to synthesize
them based on the information pulled from intents if necessary. By hiding these
details behind the request API abstraction, we don't actually need to modify
the `txnwait.Queue` at all!

The change does diverge from the RFC in two distinct ways. The first is that it
introduces a new invariant about transaction record creation. Transaction can
now only ever be created by requests originating from their own coordinator
(`BeginTxn`, `HeartbeatTxn`, and `EndTxn`). They can never be created in any
state (not even ABORTED) by concurrent actors. This is actually a much stronger
invariant than what previously existed (and even what was proposed in the RFC),
and it simplifies a lot of logic by dramatically refining the transaction state
machine. We now know that for a transaction record to exist, it must have been
created by its own coordinator and it must have checked with `CanCreateTxnRecord`.

Making this new invariant work required the second major divergence from the
RFC. The RFC suggested that we use the read timestamp cache for successful
transaction timestamp pushes before a transaction record was written. This PR
takes this a step further and uses the write timestamp cache for successful
transaction aborts before a transaction record was written. In doing this, we
emerge with a very sane timestamp cache policy - the read timestamp cache is
used to push transaction timestamps and the write timestamp cache is used to
abort them entirely (which it was already essentially being used for). The txnID
marker on these timestamp cache entries then becomes the transaction that
initiated the push/abort. Transactions then consult both of these sources before
creating their transaction record in `CanCreateTxnRecord`. In doing this, we
avoid an entire class of situations where concurrent transactions created and
abandoned transaction records for another transaction, requiring eventual GC.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 9, 2019
Based on cockroachdb#33523.
Closes cockroachdb#25437.

This is the final part of addressing cockroachdb#32971.

The change gates the use of BeginTransaction on a cluster version.
When a cluster's version is sufficiently high, it will stop sending
them.

To make this work, a small change was needed for the detection of 1PC
transactions. We now use sequence numbers to determine that a batch
includes all of the writes in a transaction. This is in line with the
proposal from the RFC's `1PC Transactions` section and will work
correctly with parallel commits.

Release note (performance improvement): Reduce the network and storage
overhead of multi-Range transactions.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 10, 2019
Informs cockroachdb#25437.
Informs cockroachdb#32971.

This is the second part of addressing cockroachdb#32971. The final part will be updating
the txn client to stop sending BeginTxn requests.

The change closely resembles what is laid out in the corresponding RFC sections
(`PushTxn` and `QueryTxn / txnwait.Queue`). Both `PushTxn` requests and `QueryTxn`
requests are adjusted to tolerate missing transaction records and to synthesize
them based on the information pulled from intents if necessary. By hiding these
details behind the request API abstraction, we don't actually need to modify
the `txnwait.Queue` at all!

The change does diverge from the RFC in two distinct ways. The first is that it
introduces a new invariant about transaction record creation. Transaction can
now only ever be created by requests originating from their own coordinator
(`BeginTxn`, `HeartbeatTxn`, and `EndTxn`). They can never be created in any
state (not even ABORTED) by concurrent actors. This is actually a much stronger
invariant than what previously existed (and even what was proposed in the RFC),
and it simplifies a lot of logic by dramatically refining the transaction state
machine. We now know that for a transaction record to exist, it must have been
created by its own coordinator and it must have checked with `CanCreateTxnRecord`.

Making this new invariant work required the second major divergence from the
RFC. The RFC suggested that we use the read timestamp cache for successful
transaction timestamp pushes before a transaction record was written. This PR
takes this a step further and uses the write timestamp cache for successful
transaction aborts before a transaction record was written. In doing this, we
emerge with a very sane timestamp cache policy - the read timestamp cache is
used to push transaction timestamps and the write timestamp cache is used to
abort them entirely (which it was already essentially being used for). The txnID
marker on these timestamp cache entries then becomes the transaction that
initiated the push/abort. Transactions then consult both of these sources before
creating their transaction record in `CanCreateTxnRecord`. In doing this, we
avoid an entire class of situations where concurrent transactions created and
abandoned transaction records for another transaction, requiring eventual GC.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 10, 2019
Informs cockroachdb#25437.
Informs cockroachdb#32971.

This is the second part of addressing cockroachdb#32971. The final part will be updating
the txn client to stop sending BeginTxn requests.

The change closely resembles what is laid out in the corresponding RFC sections
(`PushTxn` and `QueryTxn / txnwait.Queue`). Both `PushTxn` requests and `QueryTxn`
requests are adjusted to tolerate missing transaction records and to synthesize
them based on the information pulled from intents if necessary. By hiding these
details behind the request API abstraction, we don't actually need to modify
the `txnwait.Queue` at all!

The change does diverge from the RFC in two distinct ways. The first is that it
introduces a new invariant about transaction record creation. Transaction can
now only ever be created by requests originating from their own coordinator
(`BeginTxn`, `HeartbeatTxn`, and `EndTxn`). They can never be created in any
state (not even ABORTED) by concurrent actors. This is actually a much stronger
invariant than what previously existed (and even what was proposed in the RFC),
and it simplifies a lot of logic by dramatically refining the transaction state
machine. We now know that for a transaction record to exist, it must have been
created by its own coordinator and it must have checked with `CanCreateTxnRecord`.

Making this new invariant work required the second major divergence from the
RFC. The RFC suggested that we use the read timestamp cache for successful
transaction timestamp pushes before a transaction record was written. This PR
takes this a step further and uses the write timestamp cache for successful
transaction aborts before a transaction record was written. In doing this, we
emerge with a very sane timestamp cache policy - the read timestamp cache is
used to push transaction timestamps and the write timestamp cache is used to
abort them entirely (which it was already essentially being used for). The txnID
marker on these timestamp cache entries then becomes the transaction that
initiated the push/abort. Transactions then consult both of these sources before
creating their transaction record in `CanCreateTxnRecord`. In doing this, we
avoid an entire class of situations where concurrent transactions created and
abandoned transaction records for another transaction, requiring eventual GC.

Release note: None
craig bot pushed a commit that referenced this issue Jan 10, 2019
33523: storage: tolerate missing transaction records when pushing r=nvanbenschoten a=nvanbenschoten

Informs #25437.
Informs #32971.

This is the second part of addressing #32971. The final part will be updating the txn client to stop sending BeginTxn requests.

The change closely resembles what is laid out in the corresponding RFC sections (`PushTxn` and `QueryTxn / txnwait.Queue`). Both `PushTxn` requests and `QueryTxn` requests are adjusted to tolerate missing transaction records and to synthesize them based on the information pulled from intents if necessary. By hiding these details behind the request API abstraction, we don't actually need to modify the `txnwait.Queue` at all!

The change does diverge from the RFC in two distinct ways. The first is that it introduces a new invariant about transaction record creation. Transaction records can now only ever be created by requests originating from their own coordinator (`BeginTxn`, `HeartbeatTxn`, and `EndTxn`). They can never be created in any state (not even ABORTED) by concurrent actors. This is a much stronger invariant than what previously existed (and even what was proposed in the RFC), and it simplifies a lot of logic by dramatically refining the transaction state machine. We now know that for a transaction record to exist, it must have been created by its own coordinator and it must have checked with `CanCreateTxnRecord`.

Making this new invariant work required the second major divergence from the RFC. The RFC suggested that we use the read timestamp cache for successful transaction timestamp pushes before a transaction record is written. This PR takes this a step further and uses the write timestamp cache for successful transaction aborts before a transaction record is written. In doing this, we emerge with a very sane timestamp cache policy - the read timestamp cache is used to push transaction timestamps and the write timestamp cache is used to abort them entirely (which it was already essentially being used for). The txnID marker on these timestamp cache entries becomes the transaction that initiated the push/abort. Transactions then consult both of these sources before creating their transaction record in `CanCreateTxnRecord`. In doing this, we avoid an entire class of situations where concurrent transactions created and abandoned transaction records for another transaction, requiring eventual GC.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 10, 2019
Based on cockroachdb#33523.
Closes cockroachdb#25437.

This is the final part of addressing cockroachdb#32971.

The change gates the use of BeginTransaction on a cluster version.
When a cluster's version is sufficiently high, it will stop sending
them.

To make this work, a small change was needed for the detection of 1PC
transactions. We now use sequence numbers to determine that a batch
includes all of the writes in a transaction. This is in line with the
proposal from the RFC's `1PC Transactions` section and will work
correctly with parallel commits.

Release note (performance improvement): Reduce the network and storage
overhead of multi-Range transactions.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 10, 2019
Based on cockroachdb#33523.
Closes cockroachdb#25437.

This is the final part of addressing cockroachdb#32971.

The change gates the use of BeginTransaction on a cluster version.
When a cluster's version is sufficiently high, it will stop sending
them.

To make this work, a small change was needed for the detection of 1PC
transactions. We now use sequence numbers to determine that a batch
includes all of the writes in a transaction. This is in line with the
proposal from the RFC's `1PC Transactions` section and will work
correctly with parallel commits.

Release note (performance improvement): Reduce the network and storage
overhead of multi-Range transactions.
craig bot pushed a commit that referenced this issue Jan 10, 2019
33566: kv: stop using BeginTransaction requests r=nvanbenschoten a=nvanbenschoten

Based on #33523.
Closes #25437.

This is the final part of addressing #32971.

The change gates the use of BeginTransaction on a cluster version. When a cluster's version is sufficiently high, it will stop sending them.

To make this work, a small change was needed for the detection of 1PC transactions. We now use sequence numbers to determine that a batch includes all of the writes in a transaction. This is in line with the proposal from the RFC's `1PC Transactions` section and will work correctly with parallel commits.

Release note (performance improvement): Reduce the network and storage
overhead of multi-Range transactions.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in #33566 Jan 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

5 participants