Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Durability and consensus in Vtorc #8975

Closed
17 of 22 tasks
GuptaManan100 opened this issue Oct 11, 2021 · 14 comments
Closed
17 of 22 tasks

RFC - Durability and consensus in Vtorc #8975

GuptaManan100 opened this issue Oct 11, 2021 · 14 comments
Assignees
Labels
Component: VTorc Vitess Orchestrator integration Type: RFC Request For Comment

Comments

@GuptaManan100
Copy link
Member

GuptaManan100 commented Oct 11, 2021

This issue describes the feature for pluggable durability requirements and the consensus approach needed to fulfil those. It is also used to keep track of the work done and the pending things left to do before a GA release.

Feature Description

Vtorc should be used in Vitess instead of orchestrator and it should support pluggable durability requirements that the user can specify according to their use case. For example, let A, B, C, D, E and F be Vitess replica type servers for a single shard. Then the user should be able to specify the servers that each of them need ACKs from to accept writes. They can also specify servers which are not capable of becoming primaries.
Example: [A-(B,C), D-(E), F] be one such set of rules. Here A needs acks from either B or C to accept writes. Similarly, D needs ACKs from E and F can be a primary on its own.
possible quorums from inferred from these rules - ([A,B], [A,C], [D,E], [F])

Technical Details -

There are three steps that each consensus algorithm must follow, namely -

  1. Revocation - revoking access from the previous leader to accept writes.
  2. Propagation - propagate any transactions already accepted by the previous leader.
  3. Establishment - reach enough servers to establish leadership by fulfilling quorum rules.

We now outline each of these phases in greater detail.

Revocation

The first phase is to revoke access from the previous primary so that it can no longer accepts writes. This can be done either by directly contacting the primary or reaching enough of its replicas such that it cannot accept writes. Since we will use MySQL semi-sync, if we remove enough replicas, the primary will not accept any writes and will block.

The servers needed to contact to revoke access will be defined by the durability policy rules. For example, in the case described in the feature description, the possible server sets needed to be reached for revocation are ([A,E,F], [A,D,F], [B,C,E,F], [B,C,D,F]).

Propagation

The next phase is propagation where we will need to find and propagate any transactions that the previous primary accepted or might have accepted if we are unable to determine conclusively that it was rejected. There will be at-most one GTID range which had been accepted and needs to be propagated. This GTID range will the one that is found from the servers during the revoke phase with the largest timestamp.

Here we discuss multiple failure scenarios to establish the rules that need to be followed. For all these cases we assume that the primary needs 2 acknowledgements to commit a transaction.

  1. Failure Scenario - Primary and one replica failed. There is a reachable replica which has some transactions which all the other reachable replicas do not have. We cannot conclusively say whether these transactions were accepted or rejected since there is one unreachable replica and we do not know whether it sent an ACK or not. Therefore the only safe way forward to guarantee correctness is to propagate these transactions in the newer primary's term irrespective of whether they were accepted by the previous primary or not. The only guarantee we need to uphold is that any accepted transaction must be persistent and should not be lost. The transactions which timed out can either be accepted or rejected.

  2. Failure Scenario - Multiple primaries fail. Primary1 had uncommitted transactions. These were not discovered by primary2 which also then had uncommitted transactions. Primary3 discovers the transaction from primary1 but not from primary2 so it propagates that transaction. At this point the transaction from primary1 has reached enough servers to be accepted. Primary3 fails and primary4 comes up. It discovers both primary1's and primary2's transaction but because of timestamp resolution propagates the latter which will be wrong. This situation can be solved in raft by propagating transactions from the previous primary in the newer term. We must do the same i.e. we must propagate transactions as new GTIDs and not the old ones.

Final consolidated rules -

  1. If a value is discovered, assume that it is accepted.
  2. In case multiple uncommitted values are discovered, propagate the latest.
  3. While propagating these transactions, use new GTIDs.

Establishment

This is the final step. We only need to reach enough tablets such that a quorum can be formed.

Other considerations -

Lock Shard

Currently lock shard is used to prevent contention for conflicting failovers. However, this creates a dependency on the topo server to be up. Eventually we only want to use the topo server for discovery and not rely on an external system for locking.

Changing configuration rules during runtime

Changing configuration rules during runtime will become a consensus problem of its own since some nodes would have received the new updates and some would not have.

Bringing up new vttablets to join a cluster

Bringing up a new vttablet to join a pre-existing cluster will require changing the durability rules to account for this tablet. Also, a tablet cannot join the cluster during a failover otherwise it could end up replicating from the failed primary, send ACKs and some transactions will be committed which will then be lost. This will require us to lock the shard everytime a new vttablet is added to a cluster leading to large contention for the shard lock.

Rejoining of a previously failed primary

It is not always possible for a failed primary to rejoin a cluster due to errant GTIDs. There are two solutions here.

  1. Prevent the primary for entering the cluster again.
  2. Add errant GTID detection code to tablet startup. Here we might be able to rewind some transactions too using the GitHub rewind tool.

TODO list

  • Add vtorc UI into vtadmin and then remove martini code.
  • Reconcile ERS in vtorc with that of wrangler.
  • Fix ERS bug which allowed even rdonly tablets to become the primary, if it was the most advanced.
  • Reconcile PRS in vtorc with that of wrangler.
  • Reconcile InitPrimary in vtorc with that of wrangler.
  • PostERS cleanup - Rewind usage or some other way to bring the failed node back. Still needs discussion.
  • Get rid of the markerFile in replication.
  • Port over the durability from vtorc to vtctl as well.
  • Add metrics to estimate chances of success for a given node to become primary. Number of needed to reach quorum and how many are available, amount of time needed for the node to catch up to the previous primary's transactions.
  • Add unit and end to end tests for this functionality checking all the discussed failover scenarios.
  • Find whether MySQL will fail if we remove some part of the binlog file
  • Introduce Durability in vtctl, vtctld and use them to pass semiSync information
  • Durability policies should be used for fixing semi sync in vttablets and the corresponding flag in vttablet should be deprecated.
  • ERS should not stop in case of more than 1 node failures and should use the durability policies to decide if it can succeed or not.
  • Errant GTID detection code should use the durability policies.
  • Errant GTID detection on tablet startup.
  • Refresh tablet informations after locking the shard and use it to determine if a failover is even required.
  • Reparenting error handling improvements. #2281
  • VTOrc has to lock shard before all operations and refresh local information (from topo server) before deciding to run the operation
  • VTOrc should only fix 3 types of tablets (PRIMARY, REPLICA, RDONLY)
  • Snapshot keyspace shouldn't be considered for vtorc
  • Use durability policy from topo server and remove Durability configuration

References

  1. Paxos - https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf and https://lamport.azurewebsites.net/pubs/paxos-simple.pdf. These papers describe the original Paxos algorithm for distributed consensus
  2. Flexible Paxos - https://arxiv.org/pdf/1608.06696.pdf. This paper describes loosening the majority quorum requirement of Paxos and introduces flexible intersection rules.
  3. Raft - https://raft.github.io/raft.pdf. This paper describes the Raft algorithm for distributed consensus.
  4. Paxos Vs Raft - https://arxiv.org/pdf/2004.05074.pdf. This paper contrasts the two algorithms and highlights their differences in order to bridge the gap between the two algorithms.
  5. General Consensus Framework - https://arxiv.org/pdf/1902.06776.pdf This paper introduces a general framework which, if an algorithm falls under will prove its correctness as a distributed algorithm. Raft and Paxos are both shown to fall under this framework.
  6. FLP Theorem (Optional) - http://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf. This paper describes the FLP theorem which states that no consensus system which does not use time can guarantee forward progress in the face of a single faulty system.
  7. Semi-Sync - https://www.percona.com/community-blog/2018/08/23/question-about-semi-synchronous-replication-answer-with-all-the-details/ and https://dev.mysql.com/doc/refman/5.7/en/replication-semisync.html describe how semi-sync and lossless semi-sync work in MySQL along with the issues in it.
  8. GH - rewind - http://code.openark.org/blog/mysql/un-split-brain-mysql-via-gh-mysql-rewind is a GitHub developed tool for rewinding and reverting non-DDL transactions to repair split-brained servers and put them back into replication.
  9. https://planetscale.com/blog/blog-series-consensus-algorithms-at-scale-part-2 This goes into details on the Consensus rules

cc @deepthi @sougou @harshit-gangal @shlomi-noach @rafael @ajm188

@GuptaManan100 GuptaManan100 added Type: RFC Request For Comment Component: VTorc Vitess Orchestrator integration labels Oct 11, 2021
@GuptaManan100 GuptaManan100 self-assigned this Oct 11, 2021
@GuptaManan100
Copy link
Member Author

#8492 reconciles ERS, fixes its outstanding bug and ports the durability rules to a common place.

@shlomi-noach
Copy link
Contributor

Nice writeup! I'm so glad you're looking into it, @GuptaManan100 . Some initial thoughts and comments.

For example, let A, B, C, D, E and F be Vitess servers for a single shard.

While instructive to get the general idea, servers should not be addressed by name, but rather by role. It's important to not mislead to user to think about specific servers (aka pets), but to instead let the user allocate servers (cattle) into slots. "These are my serving replicas", "these are my backup servers" etc.

That way, the rule [A-(B,C), D-(E), F] could be reimagined as "failover is allowed cross-DC for non-BACKUP tablets"


Sorry if I'm nitpicking on the [A-(B,C), D-(E), F] example; it's a bit confusing. Are all these tablets part of the same shard? is D an intermediate primary? Do we really need semi-sync ack for an intermediate primary? I'm not exactly following the logic here.


There will be at-most one GTID set which had been accepted and needs to be propagated. This GTID set will the one that is found from the servers during the revoke phase with the largest timestamp.

I'm a bit confused here, and it's entirely plausible that my mind is off. What's that about a single GTID and how does it need to be propagated? If our implementation is with semi-sync, I'd assume all GTIDs are accounted for in some semi-sync acking replica.


Propagation:

For all these cases we assume that the primary needs 2 acknowledgements to commit a transaction.

I don't fully understand the examples, but at any case none of them relate to this comment of primary having 2 acking nodes. Could you please re-illustrate the examples with actual numbers?


We cannot conclusively say whether the transaction was accepted or rejected since some replicas are unreachable and we do not know whether they sent the ACKs or not. Therefore the only safe way forward to guarantee correctness is to propagate these transactions in the newer primary's term irrespective of whether they were accepted by the previous primary or not.

Did you mean that all the ack-ing tablets are also down? Let's get back to the numbers. If the primary requires 2 acking replicas, and 2 out of the acking replicas are down, then we're in the unknown. Otherwise we do know that there's at least one acking replica that's got the latest. So the scenario you're depicting needs to be refined, there's actually two or three different scenarios here that have different consequences.


Failure Scenario - Multiple primaries fail. Primary1 had uncommitted transactions. These were not discovered by primary2 which also then had uncommitted transactions. Primary3 discovers the transaction from primary1 but not from primary2

I'm again confused. What does it mean to have multiple primaries in a shard?


In case multiple uncommitted values are discovered,

What are uncommitted values? I'm not sure I understand; where do you find them and how can you tell they're uncommitted? If they're on a replica, how could they not be committed?


Lock Shard

In my opinion it's better if one vtorc nodes assumes leadership. Leadership changes should take place via topo, and that means there's still reliance on topo, but only rarely so.

A single node being the leader means the node knows that it just recovered another failover 5 seconds ago; this can mitigate cascading failures and flapping, as was intended by orchestrator's original design.


Bringing up a new vttablet to join a pre-existing cluster will require changing the durability rules to account for this tablet.

If rules are set by roles (cattle) rather than by servers (pets), then bringin a new tablet does not require changing the rules.

@GuptaManan100
Copy link
Member Author

@shlomi-noach
Sorry for the confusing nomenclature, but all the servers A-F are ordinary replica servers residing in the same shard. I just wanted to show an example of a durability policy where each server can have different ACKing servers, ( maybe because some are in different cells ). Whatever be the case, a durability policy like [A-(B,C), D-(E), F] can be specified which translates to A requires ACKs from B or C servers, D requires ACK from E and F does not need ACKs from anyone. We aren't talking about intermediate primaries or anything, just that some servers might have different rules than the others.

I'm a bit confused here, and it's entirely plausible that my mind is off. What's that about a single GTID and how does it need to be propagated? If our implementation is with semi-sync, I'd assume all GTIDs are accounted for in some semi-sync acking replica.

What I mean here is that there is atmost 1 GTID transaction set which needs to be propagated to all the servers because it was found in one of the replicas, so the new primary must acknowledge it. If there are more than one such GTID sets then only the latest one needs to be propagated. This is because the newer transactions would have only executed after the propagation phase of the previous primary and since the older transactions weren't propagated, they would not have been discovered which means that they weren't accepted and need not be propagated now.

Propagation
I have updated the wording of the first scenario to make it more understandable. The reason that I have specified that 2 ACKs are required for a primary is specifically to reach a scenario where a transaction is present in one replica but not yet accepted by the primary. ( It is still blocked waiting for the second ACK ). This is what I mean by uncommitted transactions.

Multiple primary failures mean, multiple sequential primary failures i.e. the first primary failed, someone else got elected but they failed soon after too, and so on.

If rules are set by roles (cattle) rather than by servers (pets), then bringing a new tablet does not require changing the rules.

Yes you are right in this respect but the remaining point for the tablet not being able to join a cluster while a failover is going on still stands. Also, the way to specify the rules would be upto the user, as long as the durability interface we create is used. They could have rules for specific servers (pets) with a catch all at the end for roles (cattle) too. At least that is what i did in our test suite.

I hope this clarifies your doubts. Please let me know if some wordings need to change to make the write-up more readable. Thank-you 💕

@shlomi-noach
Copy link
Contributor

We aren't talking about intermediate primaries or anything, just that some servers might have different rules than the others.

Gotcha!

What I mean here is that there is atmost 1 GTID transaction set which needs to be propagated to all the servers

I admit I still don't follow. This must entirely be my absentmindedness; let's take this offline.

This is what I mean by uncommitted transactions.

Got it. What I'm thinking is, since we're talking about a failure scenario, then the primary is anyway unreachable and is in an unknown state. Whether it has or doesn't have uncommitted transactions is immaterial at time of failover, the way I see it.

Multiple primary failures mean, multiple sequential primary failures i.e. the first primary failed, someone else got elected but they failed soon after too, and so on.

Ah, got it! Thanks for clarifying. I'm unsure about the scenario depicted, then. We should work towards a transitive logic; the rules for promoting primary2 over primary1, and then primary3 over primary2, should be transitive; our own logic should not allow a scenario where somehow primary5 suddenly finds an extra transaction unknown to primary[234].

They could have rules for specific servers (pets) with a catch all at the end for roles (cattle) too. At least that is what i did in our test suite.

Agreed, users do have pets; I think we should look at pets as the exception, and have a cattle-first approach.

Thank you and keep up the awesome work!

@GuptaManan100
Copy link
Member Author

Got it. What I'm thinking is, since we're talking about a failure scenario, then the primary is anyway unreachable and is in an unknown state. Whether it has or doesn't have uncommitted transactions is immaterial at time of failover, the way I see it.

Not necessarily. It could happen in the example servers that I have described let A be the current primary. Both B and C servers fail and therefore A is unable to accept writes. Even in this case we would have to run an ERS, but the primary would be reachable and healthy.

primary5 suddenly finds an extra transaction unknown to primary[234]

this could still happen if one of the replicas got network partitioned and then came back up. In these cases all the extra transactions that this server has should be considered errant and either they should be rolled back or the server has to be thrown away.

@shlomi-noach
Copy link
Contributor

Even in this case we would have to run an ERS, but the primary would be reachable and healthy.

In this scenario we can re-purpose another replica in place of the semi-sync acking replica, to ack writes from the primary, and without having to change the primary. What do you think?

In these cases all the extra transactions that this server has should be considered errant...

Right. But at that point none of the new primaries will have been replicating from it. I agree that the replica should then be either discarded or somehow rolled back.

@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Oct 11, 2021

In this scenario we can re-purpose another replica in place of the semi-sync acking replica, to ack writes from the primary, and without having to change the primary. What do you think?

Well that is something the user has already specified in their rules, and we run out of the servers which can ACK requests from the current primary. But there are other sets of servers which can function properly so would need to failover to them

@shlomi-noach
Copy link
Contributor

and we run out of the servers which can ACK requests from the current primary. But there are other sets of servers which can function properly so would need to failover to them

Right. So the idea of a server that can be promoted as a PRIMARY, but is not good enough to serve as a REPLICA is something I don't find very reasonable, just my 2c.

@GuptaManan100
Copy link
Member Author

Well this situation arises when the user wants cross-cell or cross-availability zones semi-sync ACKs, so a server is capable of becoming a primary but cannot send ACKs to some other servers which can become primaries too. The point of all of this is just that our code should not be assuming anything when it comes to durability policies. It is upto the user to decide what they want depending on their requirements.

@shlomi-noach
Copy link
Contributor

Makes perfect sense.

@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Dec 30, 2021

Hello,
After the work that has been done so far to reconcile PRS and ERS with vtorc, we have been successful in making them use the new durability policies to choose the new primary. The next thing to tackle is to also use the durability policies to find whether each tablet should be enabling its semi-sync or not, and use this information to fix the semi-sync.
Right now we do not use the durability policies for this purpose, and each tablet is responsible for knowing whether it should enable semi-sync or not. This is done by the enable_semi_sync flag on the vttablets.

If we want to use the durability policies then the function fixSemiSync will no longer need to rely on the flag and instead take an additional argument. This is all well and good, since we call this function largely from RPCs from vtctl or vtorc both of which have the information of durability policies.

The trouble arises from the initialisation phase of a vttablet which tries to RestoreData which also calls fix-semi-sync. This isn't called from any rpc but on tablet startup so there is no way to know the durability policies unless they are also passed to the tablets. But this doesn't seem like a great approach since the durability policy configuration would be stored now in 3 Vitess components - vttablet, vtorc and vtctl. This also needs to be consistent across all the three which would be hard to enforce. Moreover, this poses a problem in updating the durability policies...

An alternate solution is to store the durability policies in the Topo Server and all the three components can cache the required information locally. In the initial implementation, the durability policies won't change, so the cached information wouldn't have to be changed after reading once. But then we can implement something similar to shard_sync which alerts all subscribers to change in the durability policies and we can extend our framework to allow changing durability policies at runtime as well. The downside of this approach, is that we add an additional layer of dependency to the Topo Server and it would be harder to remove it from the hot path of failovers, since the durability rules would live there.

Edit - One more difficulty in the second approach, the durability policy is an interface and not basic data like ints or structs. Can we send an interface implementation on the wire and cache it?

@sougou @deepthi @harshit-gangal WDYT?

@sougou
Copy link
Contributor

sougou commented Jan 4, 2022

I'm thinking we should keep the implementation initially simple. This means that each component will require the flag and will act according to the flag, and will assume that the flag is set uniformly in all other components.

In the future, once vtorc becomes mandatory, vttablet should stop fixing semi-sync. At that time, we can deprecate the flag. As for vtctld vs vtorc flags, they're sibling components. The same problem exists between two independent versions of vtorc. So, each instance should assume that flags are set the same in other instances.

The unified topo approach is a higher level of sophistication, and I don't think we're there yet. There are many more important vars that would be better off if stored in topo. For example, the init_db_name for vttablets. So, I see this as a bigger system cleanup.

@GuptaManan100
Copy link
Member Author

As per our discussion -

  1. We should go ahead and add the additional parameter to all the RPCs which call fix-semi-sync. This parameter will be set from the vtctl side which already has the durability policy information. We will mark the enable-semi-sync flag for deprecation once this is implemented. It will be a breaking change and the users will have the responsibility of configuring the vtctl correctly during the upgrade process. This will mean that both the extra parameter from the RPC and the enable-semi-sync should be in accordance during the upgrade process. For the users having semi-sync set on all vttablets, they should use semi_sync durability policy and the others should use none

  2. In the propagation phase, we will continue to use the same GTID as before instead of introducing a new one for now. This could potentially create problem scenarios but will fix them later.

  3. We should explore SHOW PROCESSLIST as an alternate to using reparent_journal as a method to know if the replication is setup correctly

@frouioui frouioui added this to v16.0.0 Dec 6, 2022
@frouioui frouioui moved this to Backlog in v16.0.0 Dec 6, 2022
@frouioui frouioui moved this from Backlog to In Progress in v16.0.0 Dec 6, 2022
@frouioui frouioui added this to v17.0.0 Feb 8, 2023
@frouioui frouioui moved this to In Progress in v17.0.0 Feb 8, 2023
@frouioui frouioui removed this from v16.0.0 Feb 8, 2023
@frouioui frouioui added this to v18.0.0 Jun 30, 2023
@frouioui frouioui moved this to In Progress in v18.0.0 Jun 30, 2023
@frouioui frouioui removed this from v17.0.0 Jun 30, 2023
@GuptaManan100
Copy link
Member Author

A new issue with suggested improvements has been created at #14284. This issue is thus being closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTorc Vitess Orchestrator integration Type: RFC Request For Comment
Projects
Status: Done
Development

No branches or pull requests

3 participants