-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC - Durability and consensus in Vtorc #8975
Comments
#8492 reconciles ERS, fixes its outstanding bug and ports the durability rules to a common place. |
Nice writeup! I'm so glad you're looking into it, @GuptaManan100 . Some initial thoughts and comments.
While instructive to get the general idea, servers should not be addressed by name, but rather by role. It's important to not mislead to user to think about specific servers (aka pets), but to instead let the user allocate servers (cattle) into slots. "These are my serving replicas", "these are my backup servers" etc. That way, the rule Sorry if I'm nitpicking on the
I'm a bit confused here, and it's entirely plausible that my mind is off. What's that about a single GTID and how does it need to be propagated? If our implementation is with semi-sync, I'd assume all GTIDs are accounted for in some semi-sync acking replica. Propagation:
I don't fully understand the examples, but at any case none of them relate to this comment of primary having 2 acking nodes. Could you please re-illustrate the examples with actual numbers?
Did you mean that all the ack-ing tablets are also down? Let's get back to the numbers. If the primary requires 2 acking replicas, and 2 out of the acking replicas are down, then we're in the unknown. Otherwise we do know that there's at least one acking replica that's got the latest. So the scenario you're depicting needs to be refined, there's actually two or three different scenarios here that have different consequences.
I'm again confused. What does it mean to have multiple primaries in a shard?
What are uncommitted values? I'm not sure I understand; where do you find them and how can you tell they're uncommitted? If they're on a replica, how could they not be committed?
In my opinion it's better if one A single node being the leader means the node knows that it just recovered another failover 5 seconds ago; this can mitigate cascading failures and flapping, as was intended by
If rules are set by roles (cattle) rather than by servers (pets), then bringin a new tablet does not require changing the rules. |
@shlomi-noach
What I mean here is that there is atmost 1 GTID transaction set which needs to be propagated to all the servers because it was found in one of the replicas, so the new primary must acknowledge it. If there are more than one such GTID sets then only the latest one needs to be propagated. This is because the newer transactions would have only executed after the propagation phase of the previous primary and since the older transactions weren't propagated, they would not have been discovered which means that they weren't accepted and need not be propagated now. Propagation Multiple primary failures mean, multiple sequential primary failures i.e. the first primary failed, someone else got elected but they failed soon after too, and so on.
Yes you are right in this respect but the remaining point for the tablet not being able to join a cluster while a failover is going on still stands. Also, the way to specify the rules would be upto the user, as long as the durability interface we create is used. They could have rules for specific servers (pets) with a catch all at the end for roles (cattle) too. At least that is what i did in our test suite. I hope this clarifies your doubts. Please let me know if some wordings need to change to make the write-up more readable. Thank-you 💕 |
Gotcha!
I admit I still don't follow. This must entirely be my absentmindedness; let's take this offline.
Got it. What I'm thinking is, since we're talking about a failure scenario, then the primary is anyway unreachable and is in an unknown state. Whether it has or doesn't have uncommitted transactions is immaterial at time of failover, the way I see it.
Ah, got it! Thanks for clarifying. I'm unsure about the scenario depicted, then. We should work towards a transitive logic; the rules for promoting primary2 over primary1, and then primary3 over primary2, should be transitive; our own logic should not allow a scenario where somehow primary5 suddenly finds an extra transaction unknown to primary[234].
Agreed, users do have pets; I think we should look at pets as the exception, and have a cattle-first approach. Thank you and keep up the awesome work! |
Not necessarily. It could happen in the example servers that I have described let A be the current primary. Both B and C servers fail and therefore A is unable to accept writes. Even in this case we would have to run an ERS, but the primary would be reachable and healthy.
this could still happen if one of the replicas got network partitioned and then came back up. In these cases all the extra transactions that this server has should be considered errant and either they should be rolled back or the server has to be thrown away. |
In this scenario we can re-purpose another replica in place of the semi-sync acking replica, to ack writes from the primary, and without having to change the primary. What do you think?
Right. But at that point none of the new primaries will have been replicating from it. I agree that the replica should then be either discarded or somehow rolled back. |
Well that is something the user has already specified in their rules, and we run out of the servers which can ACK requests from the current primary. But there are other sets of servers which can function properly so would need to failover to them |
Right. So the idea of a server that can be promoted as a PRIMARY, but is not good enough to serve as a REPLICA is something I don't find very reasonable, just my 2c. |
Well this situation arises when the user wants cross-cell or cross-availability zones semi-sync ACKs, so a server is capable of becoming a primary but cannot send ACKs to some other servers which can become primaries too. The point of all of this is just that our code should not be assuming anything when it comes to durability policies. It is upto the user to decide what they want depending on their requirements. |
Makes perfect sense. |
Hello, If we want to use the durability policies then the function The trouble arises from the initialisation phase of a vttablet which tries to An alternate solution is to store the durability policies in the Topo Server and all the three components can cache the required information locally. In the initial implementation, the durability policies won't change, so the cached information wouldn't have to be changed after reading once. But then we can implement something similar to Edit - One more difficulty in the second approach, the durability policy is an interface and not basic data like ints or structs. Can we send an interface implementation on the wire and cache it? @sougou @deepthi @harshit-gangal WDYT? |
I'm thinking we should keep the implementation initially simple. This means that each component will require the flag and will act according to the flag, and will assume that the flag is set uniformly in all other components. In the future, once vtorc becomes mandatory, vttablet should stop fixing semi-sync. At that time, we can deprecate the flag. As for vtctld vs vtorc flags, they're sibling components. The same problem exists between two independent versions of vtorc. So, each instance should assume that flags are set the same in other instances. The unified topo approach is a higher level of sophistication, and I don't think we're there yet. There are many more important vars that would be better off if stored in topo. For example, the init_db_name for vttablets. So, I see this as a bigger system cleanup. |
As per our discussion -
|
A new issue with suggested improvements has been created at #14284. This issue is thus being closed. |
This issue describes the feature for pluggable durability requirements and the consensus approach needed to fulfil those. It is also used to keep track of the work done and the pending things left to do before a GA release.
Feature Description
Vtorc should be used in Vitess instead of orchestrator and it should support pluggable durability requirements that the user can specify according to their use case. For example, let A, B, C, D, E and F be Vitess
replica
type servers for a single shard. Then the user should be able to specify the servers that each of them need ACKs from to accept writes. They can also specify servers which are not capable of becoming primaries.Example: [A-(B,C), D-(E), F] be one such set of rules. Here A needs acks from either B or C to accept writes. Similarly, D needs ACKs from E and F can be a primary on its own.
possible quorums from inferred from these rules - ([A,B], [A,C], [D,E], [F])
Technical Details -
There are three steps that each consensus algorithm must follow, namely -
We now outline each of these phases in greater detail.
Revocation
The first phase is to revoke access from the previous primary so that it can no longer accepts writes. This can be done either by directly contacting the primary or reaching enough of its replicas such that it cannot accept writes. Since we will use MySQL semi-sync, if we remove enough replicas, the primary will not accept any writes and will block.
The servers needed to contact to revoke access will be defined by the durability policy rules. For example, in the case described in the feature description, the possible server sets needed to be reached for revocation are ([A,E,F], [A,D,F], [B,C,E,F], [B,C,D,F]).
Propagation
The next phase is propagation where we will need to find and propagate any transactions that the previous primary accepted or might have accepted if we are unable to determine conclusively that it was rejected. There will be at-most one GTID range which had been accepted and needs to be propagated. This GTID range will the one that is found from the servers during the revoke phase with the largest timestamp.
Here we discuss multiple failure scenarios to establish the rules that need to be followed. For all these cases we assume that the primary needs 2 acknowledgements to commit a transaction.
Failure Scenario - Primary and one replica failed. There is a reachable replica which has some transactions which all the other reachable replicas do not have. We cannot conclusively say whether these transactions were accepted or rejected since there is one unreachable replica and we do not know whether it sent an ACK or not. Therefore the only safe way forward to guarantee correctness is to propagate these transactions in the newer primary's term irrespective of whether they were accepted by the previous primary or not. The only guarantee we need to uphold is that any accepted transaction must be persistent and should not be lost. The transactions which timed out can either be accepted or rejected.
Failure Scenario - Multiple primaries fail. Primary1 had uncommitted transactions. These were not discovered by primary2 which also then had uncommitted transactions. Primary3 discovers the transaction from primary1 but not from primary2 so it propagates that transaction. At this point the transaction from primary1 has reached enough servers to be accepted. Primary3 fails and primary4 comes up. It discovers both primary1's and primary2's transaction but because of timestamp resolution propagates the latter which will be wrong. This situation can be solved in raft by propagating transactions from the previous primary in the newer term. We must do the same i.e. we must propagate transactions as new GTIDs and not the old ones.
Final consolidated rules -
Establishment
This is the final step. We only need to reach enough tablets such that a quorum can be formed.
Other considerations -
Lock Shard
Currently lock shard is used to prevent contention for conflicting failovers. However, this creates a dependency on the topo server to be up. Eventually we only want to use the topo server for discovery and not rely on an external system for locking.
Changing configuration rules during runtime
Changing configuration rules during runtime will become a consensus problem of its own since some nodes would have received the new updates and some would not have.
Bringing up new vttablets to join a cluster
Bringing up a new vttablet to join a pre-existing cluster will require changing the durability rules to account for this tablet. Also, a tablet cannot join the cluster during a failover otherwise it could end up replicating from the failed primary, send ACKs and some transactions will be committed which will then be lost. This will require us to lock the shard everytime a new vttablet is added to a cluster leading to large contention for the shard lock.
Rejoining of a previously failed primary
It is not always possible for a failed primary to rejoin a cluster due to errant GTIDs. There are two solutions here.
TODO list
PostERS cleanup - Rewind usage or some other way to bring the failed node back. Still needs discussion.Find whether MySQL will fail if we remove some part of the binlog fileErrant GTID detection code should use the durability policies.Errant GTID detection on tablet startup.Reparenting error handling improvements. #2281References
cc @deepthi @sougou @harshit-gangal @shlomi-noach @rafael @ajm188
The text was updated successfully, but these errors were encountered: