RFC - Durability and consensus in Vtorc #8975

GuptaManan100 · 2021-10-11T08:47:44Z

This issue describes the feature for pluggable durability requirements and the consensus approach needed to fulfil those. It is also used to keep track of the work done and the pending things left to do before a GA release.

Feature Description

Vtorc should be used in Vitess instead of orchestrator and it should support pluggable durability requirements that the user can specify according to their use case. For example, let A, B, C, D, E and F be Vitess replica type servers for a single shard. Then the user should be able to specify the servers that each of them need ACKs from to accept writes. They can also specify servers which are not capable of becoming primaries.
Example: [A-(B,C), D-(E), F] be one such set of rules. Here A needs acks from either B or C to accept writes. Similarly, D needs ACKs from E and F can be a primary on its own.
possible quorums from inferred from these rules - ([A,B], [A,C], [D,E], [F])

Technical Details -

There are three steps that each consensus algorithm must follow, namely -

Revocation - revoking access from the previous leader to accept writes.
Propagation - propagate any transactions already accepted by the previous leader.
Establishment - reach enough servers to establish leadership by fulfilling quorum rules.

We now outline each of these phases in greater detail.

Revocation

The first phase is to revoke access from the previous primary so that it can no longer accepts writes. This can be done either by directly contacting the primary or reaching enough of its replicas such that it cannot accept writes. Since we will use MySQL semi-sync, if we remove enough replicas, the primary will not accept any writes and will block.

The servers needed to contact to revoke access will be defined by the durability policy rules. For example, in the case described in the feature description, the possible server sets needed to be reached for revocation are ([A,E,F], [A,D,F], [B,C,E,F], [B,C,D,F]).

Propagation

The next phase is propagation where we will need to find and propagate any transactions that the previous primary accepted or might have accepted if we are unable to determine conclusively that it was rejected. There will be at-most one GTID range which had been accepted and needs to be propagated. This GTID range will the one that is found from the servers during the revoke phase with the largest timestamp.

Here we discuss multiple failure scenarios to establish the rules that need to be followed. For all these cases we assume that the primary needs 2 acknowledgements to commit a transaction.

Failure Scenario - Primary and one replica failed. There is a reachable replica which has some transactions which all the other reachable replicas do not have. We cannot conclusively say whether these transactions were accepted or rejected since there is one unreachable replica and we do not know whether it sent an ACK or not. Therefore the only safe way forward to guarantee correctness is to propagate these transactions in the newer primary's term irrespective of whether they were accepted by the previous primary or not. The only guarantee we need to uphold is that any accepted transaction must be persistent and should not be lost. The transactions which timed out can either be accepted or rejected.
Failure Scenario - Multiple primaries fail. Primary1 had uncommitted transactions. These were not discovered by primary2 which also then had uncommitted transactions. Primary3 discovers the transaction from primary1 but not from primary2 so it propagates that transaction. At this point the transaction from primary1 has reached enough servers to be accepted. Primary3 fails and primary4 comes up. It discovers both primary1's and primary2's transaction but because of timestamp resolution propagates the latter which will be wrong. This situation can be solved in raft by propagating transactions from the previous primary in the newer term. We must do the same i.e. we must propagate transactions as new GTIDs and not the old ones.

Final consolidated rules -

If a value is discovered, assume that it is accepted.
In case multiple uncommitted values are discovered, propagate the latest.
While propagating these transactions, use new GTIDs.

Establishment

This is the final step. We only need to reach enough tablets such that a quorum can be formed.

Other considerations -

Lock Shard

Currently lock shard is used to prevent contention for conflicting failovers. However, this creates a dependency on the topo server to be up. Eventually we only want to use the topo server for discovery and not rely on an external system for locking.

Changing configuration rules during runtime

Changing configuration rules during runtime will become a consensus problem of its own since some nodes would have received the new updates and some would not have.

Bringing up new vttablets to join a cluster

Bringing up a new vttablet to join a pre-existing cluster will require changing the durability rules to account for this tablet. Also, a tablet cannot join the cluster during a failover otherwise it could end up replicating from the failed primary, send ACKs and some transactions will be committed which will then be lost. This will require us to lock the shard everytime a new vttablet is added to a cluster leading to large contention for the shard lock.

Rejoining of a previously failed primary

It is not always possible for a failed primary to rejoin a cluster due to errant GTIDs. There are two solutions here.

Prevent the primary for entering the cluster again.
Add errant GTID detection code to tablet startup. Here we might be able to rewind some transactions too using the GitHub rewind tool.

TODO list

References

Paxos - https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf and https://lamport.azurewebsites.net/pubs/paxos-simple.pdf. These papers describe the original Paxos algorithm for distributed consensus
Flexible Paxos - https://arxiv.org/pdf/1608.06696.pdf. This paper describes loosening the majority quorum requirement of Paxos and introduces flexible intersection rules.
Raft - https://raft.github.io/raft.pdf. This paper describes the Raft algorithm for distributed consensus.
Paxos Vs Raft - https://arxiv.org/pdf/2004.05074.pdf. This paper contrasts the two algorithms and highlights their differences in order to bridge the gap between the two algorithms.
General Consensus Framework - https://arxiv.org/pdf/1902.06776.pdf This paper introduces a general framework which, if an algorithm falls under will prove its correctness as a distributed algorithm. Raft and Paxos are both shown to fall under this framework.
FLP Theorem (Optional) - http://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf. This paper describes the FLP theorem which states that no consensus system which does not use time can guarantee forward progress in the face of a single faulty system.
Semi-Sync - https://www.percona.com/community-blog/2018/08/23/question-about-semi-synchronous-replication-answer-with-all-the-details/ and https://dev.mysql.com/doc/refman/5.7/en/replication-semisync.html describe how semi-sync and lossless semi-sync work in MySQL along with the issues in it.
GH - rewind - http://code.openark.org/blog/mysql/un-split-brain-mysql-via-gh-mysql-rewind is a GitHub developed tool for rewinding and reverting non-DDL transactions to repair split-brained servers and put them back into replication.
https://planetscale.com/blog/blog-series-consensus-algorithms-at-scale-part-2 This goes into details on the Consensus rules

cc @deepthi @sougou @harshit-gangal @shlomi-noach @rafael @ajm188

The text was updated successfully, but these errors were encountered:

GuptaManan100 · 2021-10-11T09:03:54Z

#8492 reconciles ERS, fixes its outstanding bug and ports the durability rules to a common place.

shlomi-noach · 2021-10-11T09:52:41Z

Nice writeup! I'm so glad you're looking into it, @GuptaManan100 . Some initial thoughts and comments.

For example, let A, B, C, D, E and F be Vitess servers for a single shard.

While instructive to get the general idea, servers should not be addressed by name, but rather by role. It's important to not mislead to user to think about specific servers (aka pets), but to instead let the user allocate servers (cattle) into slots. "These are my serving replicas", "these are my backup servers" etc.

That way, the rule [A-(B,C), D-(E), F] could be reimagined as "failover is allowed cross-DC for non-BACKUP tablets"

Sorry if I'm nitpicking on the [A-(B,C), D-(E), F] example; it's a bit confusing. Are all these tablets part of the same shard? is D an intermediate primary? Do we really need semi-sync ack for an intermediate primary? I'm not exactly following the logic here.

There will be at-most one GTID set which had been accepted and needs to be propagated. This GTID set will the one that is found from the servers during the revoke phase with the largest timestamp.

I'm a bit confused here, and it's entirely plausible that my mind is off. What's that about a single GTID and how does it need to be propagated? If our implementation is with semi-sync, I'd assume all GTIDs are accounted for in some semi-sync acking replica.

Propagation:

For all these cases we assume that the primary needs 2 acknowledgements to commit a transaction.

I don't fully understand the examples, but at any case none of them relate to this comment of primary having 2 acking nodes. Could you please re-illustrate the examples with actual numbers?

We cannot conclusively say whether the transaction was accepted or rejected since some replicas are unreachable and we do not know whether they sent the ACKs or not. Therefore the only safe way forward to guarantee correctness is to propagate these transactions in the newer primary's term irrespective of whether they were accepted by the previous primary or not.

Did you mean that all the ack-ing tablets are also down? Let's get back to the numbers. If the primary requires 2 acking replicas, and 2 out of the acking replicas are down, then we're in the unknown. Otherwise we do know that there's at least one acking replica that's got the latest. So the scenario you're depicting needs to be refined, there's actually two or three different scenarios here that have different consequences.

Failure Scenario - Multiple primaries fail. Primary1 had uncommitted transactions. These were not discovered by primary2 which also then had uncommitted transactions. Primary3 discovers the transaction from primary1 but not from primary2

I'm again confused. What does it mean to have multiple primaries in a shard?

In case multiple uncommitted values are discovered,

What are uncommitted values? I'm not sure I understand; where do you find them and how can you tell they're uncommitted? If they're on a replica, how could they not be committed?

Lock Shard

In my opinion it's better if one vtorc nodes assumes leadership. Leadership changes should take place via topo, and that means there's still reliance on topo, but only rarely so.

A single node being the leader means the node knows that it just recovered another failover 5 seconds ago; this can mitigate cascading failures and flapping, as was intended by orchestrator's original design.

Bringing up a new vttablet to join a pre-existing cluster will require changing the durability rules to account for this tablet.

If rules are set by roles (cattle) rather than by servers (pets), then bringin a new tablet does not require changing the rules.

GuptaManan100 · 2021-10-11T11:00:12Z

@shlomi-noach
Sorry for the confusing nomenclature, but all the servers A-F are ordinary replica servers residing in the same shard. I just wanted to show an example of a durability policy where each server can have different ACKing servers, ( maybe because some are in different cells ). Whatever be the case, a durability policy like [A-(B,C), D-(E), F] can be specified which translates to A requires ACKs from B or C servers, D requires ACK from E and F does not need ACKs from anyone. We aren't talking about intermediate primaries or anything, just that some servers might have different rules than the others.

I'm a bit confused here, and it's entirely plausible that my mind is off. What's that about a single GTID and how does it need to be propagated? If our implementation is with semi-sync, I'd assume all GTIDs are accounted for in some semi-sync acking replica.

What I mean here is that there is atmost 1 GTID transaction set which needs to be propagated to all the servers because it was found in one of the replicas, so the new primary must acknowledge it. If there are more than one such GTID sets then only the latest one needs to be propagated. This is because the newer transactions would have only executed after the propagation phase of the previous primary and since the older transactions weren't propagated, they would not have been discovered which means that they weren't accepted and need not be propagated now.

Propagation
I have updated the wording of the first scenario to make it more understandable. The reason that I have specified that 2 ACKs are required for a primary is specifically to reach a scenario where a transaction is present in one replica but not yet accepted by the primary. ( It is still blocked waiting for the second ACK ). This is what I mean by uncommitted transactions.

Multiple primary failures mean, multiple sequential primary failures i.e. the first primary failed, someone else got elected but they failed soon after too, and so on.

If rules are set by roles (cattle) rather than by servers (pets), then bringing a new tablet does not require changing the rules.

Yes you are right in this respect but the remaining point for the tablet not being able to join a cluster while a failover is going on still stands. Also, the way to specify the rules would be upto the user, as long as the durability interface we create is used. They could have rules for specific servers (pets) with a catch all at the end for roles (cattle) too. At least that is what i did in our test suite.

I hope this clarifies your doubts. Please let me know if some wordings need to change to make the write-up more readable. Thank-you 💕

shlomi-noach · 2021-10-11T11:51:53Z

We aren't talking about intermediate primaries or anything, just that some servers might have different rules than the others.

Gotcha!

What I mean here is that there is atmost 1 GTID transaction set which needs to be propagated to all the servers

I admit I still don't follow. This must entirely be my absentmindedness; let's take this offline.

This is what I mean by uncommitted transactions.

Got it. What I'm thinking is, since we're talking about a failure scenario, then the primary is anyway unreachable and is in an unknown state. Whether it has or doesn't have uncommitted transactions is immaterial at time of failover, the way I see it.

Multiple primary failures mean, multiple sequential primary failures i.e. the first primary failed, someone else got elected but they failed soon after too, and so on.

Ah, got it! Thanks for clarifying. I'm unsure about the scenario depicted, then. We should work towards a transitive logic; the rules for promoting primary2 over primary1, and then primary3 over primary2, should be transitive; our own logic should not allow a scenario where somehow primary5 suddenly finds an extra transaction unknown to primary[234].

They could have rules for specific servers (pets) with a catch all at the end for roles (cattle) too. At least that is what i did in our test suite.

Agreed, users do have pets; I think we should look at pets as the exception, and have a cattle-first approach.

Thank you and keep up the awesome work!

GuptaManan100 · 2021-10-11T12:02:55Z

Got it. What I'm thinking is, since we're talking about a failure scenario, then the primary is anyway unreachable and is in an unknown state. Whether it has or doesn't have uncommitted transactions is immaterial at time of failover, the way I see it.

Not necessarily. It could happen in the example servers that I have described let A be the current primary. Both B and C servers fail and therefore A is unable to accept writes. Even in this case we would have to run an ERS, but the primary would be reachable and healthy.

primary5 suddenly finds an extra transaction unknown to primary[234]

this could still happen if one of the replicas got network partitioned and then came back up. In these cases all the extra transactions that this server has should be considered errant and either they should be rolled back or the server has to be thrown away.

shlomi-noach · 2021-10-11T12:08:15Z

Even in this case we would have to run an ERS, but the primary would be reachable and healthy.

In this scenario we can re-purpose another replica in place of the semi-sync acking replica, to ack writes from the primary, and without having to change the primary. What do you think?

In these cases all the extra transactions that this server has should be considered errant...

Right. But at that point none of the new primaries will have been replicating from it. I agree that the replica should then be either discarded or somehow rolled back.

GuptaManan100 · 2021-10-11T12:20:15Z

In this scenario we can re-purpose another replica in place of the semi-sync acking replica, to ack writes from the primary, and without having to change the primary. What do you think?

Well that is something the user has already specified in their rules, and we run out of the servers which can ACK requests from the current primary. But there are other sets of servers which can function properly so would need to failover to them

shlomi-noach · 2021-10-11T12:29:58Z

and we run out of the servers which can ACK requests from the current primary. But there are other sets of servers which can function properly so would need to failover to them

Right. So the idea of a server that can be promoted as a PRIMARY, but is not good enough to serve as a REPLICA is something I don't find very reasonable, just my 2c.

GuptaManan100 · 2021-10-11T13:02:35Z

Well this situation arises when the user wants cross-cell or cross-availability zones semi-sync ACKs, so a server is capable of becoming a primary but cannot send ACKs to some other servers which can become primaries too. The point of all of this is just that our code should not be assuming anything when it comes to durability policies. It is upto the user to decide what they want depending on their requirements.

shlomi-noach · 2021-10-11T13:23:19Z

Makes perfect sense.

GuptaManan100 · 2021-12-30T12:03:01Z

Hello,
After the work that has been done so far to reconcile PRS and ERS with vtorc, we have been successful in making them use the new durability policies to choose the new primary. The next thing to tackle is to also use the durability policies to find whether each tablet should be enabling its semi-sync or not, and use this information to fix the semi-sync.
Right now we do not use the durability policies for this purpose, and each tablet is responsible for knowing whether it should enable semi-sync or not. This is done by the enable_semi_sync flag on the vttablets.

If we want to use the durability policies then the function fixSemiSync will no longer need to rely on the flag and instead take an additional argument. This is all well and good, since we call this function largely from RPCs from vtctl or vtorc both of which have the information of durability policies.

The trouble arises from the initialisation phase of a vttablet which tries to RestoreData which also calls fix-semi-sync. This isn't called from any rpc but on tablet startup so there is no way to know the durability policies unless they are also passed to the tablets. But this doesn't seem like a great approach since the durability policy configuration would be stored now in 3 Vitess components - vttablet, vtorc and vtctl. This also needs to be consistent across all the three which would be hard to enforce. Moreover, this poses a problem in updating the durability policies...

An alternate solution is to store the durability policies in the Topo Server and all the three components can cache the required information locally. In the initial implementation, the durability policies won't change, so the cached information wouldn't have to be changed after reading once. But then we can implement something similar to shard_sync which alerts all subscribers to change in the durability policies and we can extend our framework to allow changing durability policies at runtime as well. The downside of this approach, is that we add an additional layer of dependency to the Topo Server and it would be harder to remove it from the hot path of failovers, since the durability rules would live there.

Edit - One more difficulty in the second approach, the durability policy is an interface and not basic data like ints or structs. Can we send an interface implementation on the wire and cache it?

@sougou @deepthi @harshit-gangal WDYT?

sougou · 2022-01-04T21:12:21Z

I'm thinking we should keep the implementation initially simple. This means that each component will require the flag and will act according to the flag, and will assume that the flag is set uniformly in all other components.

In the future, once vtorc becomes mandatory, vttablet should stop fixing semi-sync. At that time, we can deprecate the flag. As for vtctld vs vtorc flags, they're sibling components. The same problem exists between two independent versions of vtorc. So, each instance should assume that flags are set the same in other instances.

The unified topo approach is a higher level of sophistication, and I don't think we're there yet. There are many more important vars that would be better off if stored in topo. For example, the init_db_name for vttablets. So, I see this as a bigger system cleanup.

GuptaManan100 · 2022-01-06T18:09:36Z

As per our discussion -

We should go ahead and add the additional parameter to all the RPCs which call fix-semi-sync. This parameter will be set from the vtctl side which already has the durability policy information. We will mark the enable-semi-sync flag for deprecation once this is implemented. It will be a breaking change and the users will have the responsibility of configuring the vtctl correctly during the upgrade process. This will mean that both the extra parameter from the RPC and the enable-semi-sync should be in accordance during the upgrade process. For the users having semi-sync set on all vttablets, they should use semi_sync durability policy and the others should use none
In the propagation phase, we will continue to use the same GTID as before instead of introducing a new one for now. This could potentially create problem scenarios but will fix them later.
We should explore SHOW PROCESSLIST as an alternate to using reparent_journal as a method to know if the replication is setup correctly

GuptaManan100 · 2023-10-16T05:44:05Z

A new issue with suggested improvements has been created at #14284. This issue is thus being closed.

GuptaManan100 added Type: RFC Request For Comment Component: VTorc Vitess Orchestrator integration labels Oct 11, 2021

GuptaManan100 self-assigned this Oct 11, 2021

GuptaManan100 mentioned this issue Oct 11, 2021

Have a common ERS for both VtOrc and Vtctl #8492

Merged

2 tasks

GuptaManan100 mentioned this issue Oct 17, 2021

Request for Vitess-native failover support, without Orchestrator dependency #2366

Closed

This was referenced Nov 18, 2021

Use PRS in VtOrc while doing Graceful Primary Takeover #9258

Merged

Make PRS use Durability Rules #9259

Merged

Merge InitShardPrimary functionality into PlannedReparentShard #9276

Merged

GuptaManan100 mentioned this issue Dec 17, 2021

Use PRS in Vtorc for electing new primary #9409

Merged

3 tasks

GuptaManan100 mentioned this issue Jan 14, 2022

Fix Ers to be used for initialization as well #9511

Merged

3 tasks

GuptaManan100 mentioned this issue Jan 22, 2022

Pass Semi sync information from durability policy and use it to log discrepancies #9533

Merged

3 tasks

GuptaManan100 mentioned this issue Feb 9, 2022

Revocation phase using durability policies #9659

Merged

3 tasks

This was referenced Feb 22, 2022

Change semi-sync information to use the parameter passed and deprecate enable_semi_sync #9725

Merged

Filter candidates which cannot make forward progress after ERS #9765

Merged

GuptaManan100 mentioned this issue Mar 7, 2022

RFC: Existing functionality of VTOrc UI, API and what needs to be ported to VTAdmin #9835

Closed

deepthi mentioned this issue Mar 31, 2022

RFC: Planned Reparents with Orchestrator #6840

Closed

GuptaManan100 mentioned this issue Mar 31, 2022

RFC: Orchestrator as a native Vitess component #6612

Closed

GuptaManan100 mentioned this issue Apr 20, 2022

Refresh ephemeral information before cluster operations in VTOrc #10115

Merged

3 tasks

This was referenced May 3, 2022

VTOrc: Refactor and reload of ephemeral information for remaining recovery functions #10150

Merged

Revert semi_sync change to reintroduce enable_semi_sync and remove durability_policy #10201

Merged

GuptaManan100 mentioned this issue May 26, 2022

Adds DurabilityPolicy to the KeyspaceInfo in the topo and the associated RPCs #10221

Merged

3 tasks

deepthi mentioned this issue May 26, 2022

RFC: Semi-sync management #5505

Closed

This was referenced Jun 1, 2022

Refactor Durability Policy implementation and usage to read the durability policy from the keyspace #10375

Merged

Use Durability Policy in the topo server in VTOrc and deprecate Durability config #10423

Merged

GuptaManan100 mentioned this issue Sep 14, 2022

Only refresh required tablet's information in VTOrc #11220

Merged

3 tasks

GuptaManan100 mentioned this issue Sep 21, 2022

Improve PRS to validate new primary can make forward progress #11308

Merged

3 tasks

GuptaManan100 mentioned this issue Oct 13, 2022

[15.0] Fix VTOrc to handle multiple failures #11489

Merged

3 tasks

frouioui added this to v16.0.0 Dec 6, 2022

frouioui moved this to Backlog in v16.0.0 Dec 6, 2022

frouioui moved this from Backlog to In Progress in v16.0.0 Dec 6, 2022

frouioui added this to v17.0.0 Feb 8, 2023

frouioui moved this to In Progress in v17.0.0 Feb 8, 2023

frouioui removed this from v16.0.0 Feb 8, 2023

GuptaManan100 mentioned this issue Apr 19, 2023

Deprecate durability_policy flag in vtctld #12930

Merged

4 tasks

frouioui added this to v18.0.0 Jun 30, 2023

frouioui moved this to In Progress in v18.0.0 Jun 30, 2023

frouioui removed this from v17.0.0 Jun 30, 2023

GuptaManan100 closed this as completed Oct 16, 2023

github-project-automation bot moved this from In Progress to Done in v18.0.0 Oct 16, 2023

github-project-automation bot added this to Cluster Management Aug 29, 2024

github-project-automation bot moved this to Done in Cluster Management Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC - Durability and consensus in Vtorc #8975

RFC - Durability and consensus in Vtorc #8975

GuptaManan100 commented Oct 11, 2021 •

edited

Loading

GuptaManan100 commented Oct 11, 2021

shlomi-noach commented Oct 11, 2021

GuptaManan100 commented Oct 11, 2021

shlomi-noach commented Oct 11, 2021

GuptaManan100 commented Oct 11, 2021

shlomi-noach commented Oct 11, 2021

GuptaManan100 commented Oct 11, 2021 •

edited

Loading

shlomi-noach commented Oct 11, 2021

GuptaManan100 commented Oct 11, 2021

shlomi-noach commented Oct 11, 2021

GuptaManan100 commented Dec 30, 2021 •

edited

Loading

sougou commented Jan 4, 2022

GuptaManan100 commented Jan 6, 2022

GuptaManan100 commented Oct 16, 2023

RFC - Durability and consensus in Vtorc #8975

RFC - Durability and consensus in Vtorc #8975

Comments

GuptaManan100 commented Oct 11, 2021 • edited Loading

Feature Description

Technical Details -

Revocation

Propagation

Establishment

Other considerations -

Lock Shard

Changing configuration rules during runtime

Bringing up new vttablets to join a cluster

Rejoining of a previously failed primary

TODO list

GuptaManan100 commented Oct 11, 2021

shlomi-noach commented Oct 11, 2021

GuptaManan100 commented Oct 11, 2021

shlomi-noach commented Oct 11, 2021

GuptaManan100 commented Oct 11, 2021

shlomi-noach commented Oct 11, 2021

GuptaManan100 commented Oct 11, 2021 • edited Loading

shlomi-noach commented Oct 11, 2021

GuptaManan100 commented Oct 11, 2021

shlomi-noach commented Oct 11, 2021

GuptaManan100 commented Dec 30, 2021 • edited Loading

sougou commented Jan 4, 2022

GuptaManan100 commented Jan 6, 2022

GuptaManan100 commented Oct 16, 2023

GuptaManan100 commented Oct 11, 2021 •

edited

Loading

GuptaManan100 commented Oct 11, 2021 •

edited

Loading

GuptaManan100 commented Dec 30, 2021 •

edited

Loading