-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: improve the availability related to member change #7625
Comments
/cc @xiang90 |
I don't understand what the atomicity buys over remove/add? For the first case, if the node failed, then removing it from membership doesn't seem to make the fault tolerance worse-- the cluster will lose availability if another node fails regardless. The window is slightly shorter with ReplaceNode since there's only one a commit instead of two, but window is still there. The rack case is more convincing, but I'm not sure if there's an advantage over remove/add at the moment. There's still the risk that the replacement node will be misconfigured and will fail to join with raft, so it loses add/remove's advantage that ensures the new node is participating in raft before shutting off the old node. There would have to be non-voting member support before ReplaceNode can be safe. |
For TiKV or Cockroachdb, which needs to schedule multi raft groups in different DCs, If one DC fails when scheduler is working (moving one replica from a host to another host in same DC), the corresponding raft group can not work unless DC is up again, regardless of whether using add + remove or remove + add. Of course, we can use 5 replicas to solve the problem in 3 DCs, but I think this may hurt the performance and is not necessary. A better way is to implement atomic Replace operation or joint consensus, both are mentioned in Raft Paper. Supporting joint consensus will change a lot codes in etcd raft, Replace may be more simple here. |
@siddontang sure, I understand that. I'm asking if ReplaceNode will be effective enough at maintaining availability without non-voting member support. The new node could turn out to be incapable of participating in raft but there's no way of knowing until after it's added, so there's still a risk of going from 3 to 2. Non-voting would at least establish that the new node can contact the raft group and accept commits. Or is this not a problem? ReplaceNode would be useful on its own provided the new node can be guaranteed to catch up with the cluster quickly (which is better than nothing, as it is now), just saying that it seems like it solves only half of it... |
@heyitsanthony Yeah, it needs some time to catch up with the leader for a blank new node which is just added into the cluster. And there's still a risk of node failure during this "ReplaceNode" procedure. A better way to perform the member change in this situation would be:
Each of 1, 2 could be implemented once at a time and then together to solve the issue. We could get 2 as the first step. |
There are cases where you can add a bad member (too slow to catch up, unreachable IP). I think @heyitsanthony feels those risks outperform the risk of allowing the small window between add/remove. If we really care about safety, we should fix the most significant problem first. The right way to do the replacement is probably:
|
Starting with ReplaceNode seems fine... it's still one less way the cluster can break (even without non-voting) and it needs to be done anyway. |
no one is actively working on this. i am going to move it to unplanned. |
The CockroachDB team is planning to implement joint consensus in the upcoming ~6 month release cycle (to commence shortly). An in-progress RFC is available. Similar to what has been stated in this thread, an atomic "replace" primitive would be enough on our end as well, though unless this allows for some kind of simplification we will implement support for arbitrary membership changes. The research done in the RFC so far suggests that the etcd approach of activating configuration changes only when nodes apply them is undesirable (if not outright unsound). The plan is thus to implement joint consensus as outlined in the Raft thesis, where a configuration change becomes active the moment it is appended to the log. Input on this decision is welcome. We're somewhat in the dark on why The move to joint consensus presents a challenge for upgrades. For compatibility reasons, we need the version of I'm curious what expectation maintainers here have regarding the interface Personally, I believe that asking users to slightly adapt their interfaces is fine, but I would like to have some clarity from the maintainers going forward. |
@tbg Do you want to talk about this in etcd community meeting? |
This is the closest that I can find: https://groups.google.com/d/msg/raft-dev/t4xj6dJTP6E/NMfCgTO90t0J
|
The Progress maps contain both the active configuration and information about the replication status. By pulling it into its own component, this becomes easier to unit test and also clarifies the code, which will see changes as etcd-io#7625 is addressed. More functionality will move into `prs` in self-contained follow-up commits.
This cleans up the mechanical refactor in the last commit and will help with etcd-io#7625 as well.
This now delegates the quorum computation to r.prs, which will allow it to generalize in a straightforward way when etcd-io#7625 is addressed.
We have already supported joint consensus in the Rust raft-rs tikv/raft-rs#202, but have not used in TiKV yet. We can contribute back to etcd Raft. |
Thanks @jingyih, I'll be there. Added an item the agenda. |
The Progress maps contain both the active configuration and information about the replication status. By pulling it into its own component, this becomes easier to unit test and also clarifies the code, which will see changes as etcd-io#7625 is addressed. More functionality will move into `prs` in self-contained follow-up commits.
This cleans up the mechanical refactor in the last commit and will help with etcd-io#7625 as well.
This now delegates the quorum computation to r.prs, which will allow it to generalize in a straightforward way when etcd-io#7625 is addressed.
So some history here. When we started to implement etcd/raft, we want to start with a simpler reconfiguration protocol (both to save time and to explore if it is actually feasible). As a result, etcd implemented the atomic membership change first, then Raft thesis published with a modified version of atomic membership change later on. The idea is similar, but they are not exactly the same. I like apply time reconfiguration since the reconfig command will involve both raft protocol change as well as the application level change. I want the application level change to be strictly ordered. To implement this with commit time reconfiguration, the code/execution path for raft protocol change and application level change for reconfiguration have to be separated. From safety perspective, I remember we added a few barriers to avoid the double in flight reconfiguration. To be honest, I do not member the details now, and I am not sure if the cases described in your analysis can actually happen. If it can happen, I guess it is still fixable :P. |
I am fine with adding joint membership change as an alternative way. Do you really need to replace the atomic one or they can co-exist (at least some can config to use one or the other at start time)? |
This commit introduces machinery to safely apply joint consensus configuration changes to Raft. The main contribution is the new package, `confchange`, which offers the primitives `Simple`, `EnterJoint`, and `LeaveJoint`. The first two take a list of configuration changes. `Simple` only declares success if these configuration changes (applied atomically) change the set of voters by at most one (i.e. it's fine to add or remove any number of learners, but change only one voter). `EnterJoint` makes the configuration joint and then applies the changes to it, in preparation of the caller returning later and transitioning out of the joint config into the final desired configuration via `LeaveJoint()`. This commit streamlines the conversion between voters and learners, which is now generally allowed whenever the above conditions are upheld (i.e. it's not possible to demote a voter and add a new voter in the context of a Simple configuration change, but it is possible via EnterJoint). Previously, we had the artificial restriction that a voter could not be demoted to a learner, but had to be removed first. Even though demoting a learner is generally less useful than promoting a learner (the latter is used to catch up future voters), demotions could see use in improved handling of temporary node unavailability, where it is desired to remove voting power from a down node, but to preserve its data should it return. An additional change that was made in this commit is to prevent the use of empty commit quorums, which was previously possible but for no good reason; this: Closes etcd-io#10884. The work left to do in a future PR is to actually expose joint configurations to the applications using Raft. This will entail mostly API design and the addition of suitable testing, which to be carried out ergonomically is likely to motivate a larger refactor. Touches etcd-io#7625.
This commit introduces machinery to safely apply joint consensus configuration changes to Raft. The main contribution is the new package, `confchange`, which offers the primitives `Simple`, `EnterJoint`, and `LeaveJoint`. The first two take a list of configuration changes. `Simple` only declares success if these configuration changes (applied atomically) change the set of voters by at most one (i.e. it's fine to add or remove any number of learners, but change only one voter). `EnterJoint` makes the configuration joint and then applies the changes to it, in preparation of the caller returning later and transitioning out of the joint config into the final desired configuration via `LeaveJoint()`. This commit streamlines the conversion between voters and learners, which is now generally allowed whenever the above conditions are upheld (i.e. it's not possible to demote a voter and add a new voter in the context of a Simple configuration change, but it is possible via EnterJoint). Previously, we had the artificial restriction that a voter could not be demoted to a learner, but had to be removed first. Even though demoting a learner is generally less useful than promoting a learner (the latter is used to catch up future voters), demotions could see use in improved handling of temporary node unavailability, where it is desired to remove voting power from a down node, but to preserve its data should it return. An additional change that was made in this commit is to prevent the use of empty commit quorums, which was previously possible but for no good reason; this: Closes etcd-io#10884. The work left to do in a future PR is to actually expose joint configurations to the applications using Raft. This will entail mostly API design and the addition of suitable testing, which to be carried out ergonomically is likely to motivate a larger refactor. Touches etcd-io#7625.
It turns out that that learners must be allowed to cast votes. This seems counter- intuitive but is necessary in the situation in which a learner has been promoted (i.e. is now a voter) but has not learned about this yet. For example, consider a group in which id=1 is a learner and id=2 and id=3 are voters. A configuration change promoting 1 can be committed on the quorum `{2,3}` without the config change being appended to the learner's log. If the leader (say 2) fails, there are de facto two voters remaining. Only 3 can win an election (due to its log containing all committed entries), but to do so it will need 1 to vote. But 1 considers itself a learner and will continue to do so until 3 has stepped up as leader, replicates the conf change to 1, and 1 applies it. Ultimately, by receiving a request to vote, the learner realizes that the candidate believes it to be a voter, and that it should act accordingly. The candidate's config may be stale, too; but in that case it won't win the election, at least in the absence of the bug discussed in: etcd-io#7625 (comment).
It has often been tedious to test the interactions between multi-member Raft groups, especially when many steps were required to reach a certain scenario. Often, this boilerplate was as boring as it is hard to write and hard to maintain, making it attractive to resort to shortcuts whenever possible, which in turn tended to undercut how meaningful and maintainable the tests ended up being - that is, if the tests were even written, which sometimes they weren't. This change introduces a datadriven framework specifically for testing deterministically the interaction between multiple members of a raft group with the goal of reducing the friction for writing these tests to near zero. In the near term, this will be used to add thorough testing for joint consensus (which is already available today, but wildly undertested), but just converting an existing test into this framework has shown that the concise representation and built-in inspection of log messages highlights unexpected behavior much more readily than the previous unit tests did (the test in question is `snapshot_succeed_via_app_resp`; the reader is invited to compare the old and new version of it). The main building block is `InteractionEnv`, which holds on to the state of the whole system and exposes various relevant methods for manipulating it, including but not limited to adding nodes, delivering and dropping messages, and proposing configuration changes. All of this is extensible so that in the future I hope to use it to explore the phenomena discussed in etcd-io#7625 (comment) which requires injecting appropriate "crash points" in the Ready handling loop. Discussions of the "what if X happened in state Y" can quickly be made concrete by "scripting up an interaction test". Additionally, this framework is intentionally not kept internal to the raft package.. Though this is in its infancy, a goal is that it should be possible for a suite of interaction tests to allow applications to validate that their Storage implementation behaves accordingly, simply by running a raft-provided interaction suite against their Storage.
It turns out that that learners must be allowed to cast votes. This seems counter- intuitive but is necessary in the situation in which a learner has been promoted (i.e. is now a voter) but has not learned about this yet. For example, consider a group in which id=1 is a learner and id=2 and id=3 are voters. A configuration change promoting 1 can be committed on the quorum `{2,3}` without the config change being appended to the learner's log. If the leader (say 2) fails, there are de facto two voters remaining. Only 3 can win an election (due to its log containing all committed entries), but to do so it will need 1 to vote. But 1 considers itself a learner and will continue to do so until 3 has stepped up as leader, replicates the conf change to 1, and 1 applies it. Ultimately, by receiving a request to vote, the learner realizes that the candidate believes it to be a voter, and that it should act accordingly. The candidate's config may be stale, too; but in that case it won't win the election, at least in the absence of the bug discussed in: #7625 (comment).
It has often been tedious to test the interactions between multi-member Raft groups, especially when many steps were required to reach a certain scenario. Often, this boilerplate was as boring as it is hard to write and hard to maintain, making it attractive to resort to shortcuts whenever possible, which in turn tended to undercut how meaningful and maintainable the tests ended up being - that is, if the tests were even written, which sometimes they weren't. This change introduces a datadriven framework specifically for testing deterministically the interaction between multiple members of a raft group with the goal of reducing the friction for writing these tests to near zero. In the near term, this will be used to add thorough testing for joint consensus (which is already available today, but wildly undertested), but just converting an existing test into this framework has shown that the concise representation and built-in inspection of log messages highlights unexpected behavior much more readily than the previous unit tests did (the test in question is `snapshot_succeed_via_app_resp`; the reader is invited to compare the old and new version of it). The main building block is `InteractionEnv`, which holds on to the state of the whole system and exposes various relevant methods for manipulating it, including but not limited to adding nodes, delivering and dropping messages, and proposing configuration changes. All of this is extensible so that in the future I hope to use it to explore the phenomena discussed in etcd-io#7625 (comment) which requires injecting appropriate "crash points" in the Ready handling loop. Discussions of the "what if X happened in state Y" can quickly be made concrete by "scripting up an interaction test". Additionally, this framework is intentionally not kept internal to the raft package.. Though this is in its infancy, a goal is that it should be possible for a suite of interaction tests to allow applications to validate that their Storage implementation behaves accordingly, simply by running a raft-provided interaction suite against their Storage.
It has often been tedious to test the interactions between multi-member Raft groups, especially when many steps were required to reach a certain scenario. Often, this boilerplate was as boring as it is hard to write and hard to maintain, making it attractive to resort to shortcuts whenever possible, which in turn tended to undercut how meaningful and maintainable the tests ended up being - that is, if the tests were even written, which sometimes they weren't. This change introduces a datadriven framework specifically for testing deterministically the interaction between multiple members of a raft group with the goal of reducing the friction for writing these tests to near zero. In the near term, this will be used to add thorough testing for joint consensus (which is already available today, but wildly undertested), but just converting an existing test into this framework has shown that the concise representation and built-in inspection of log messages highlights unexpected behavior much more readily than the previous unit tests did (the test in question is `snapshot_succeed_via_app_resp`; the reader is invited to compare the old and new version of it). The main building block is `InteractionEnv`, which holds on to the state of the whole system and exposes various relevant methods for manipulating it, including but not limited to adding nodes, delivering and dropping messages, and proposing configuration changes. All of this is extensible so that in the future I hope to use it to explore the phenomena discussed in etcd-io#7625 (comment) which requires injecting appropriate "crash points" in the Ready handling loop. Discussions of the "what if X happened in state Y" can quickly be made concrete by "scripting up an interaction test". Additionally, this framework is intentionally not kept internal to the raft package.. Though this is in its infancy, a goal is that it should be possible for a suite of interaction tests to allow applications to validate that their Storage implementation behaves accordingly, simply by running a raft-provided interaction suite against their Storage.
It has often been tedious to test the interactions between multi-member Raft groups, especially when many steps were required to reach a certain scenario. Often, this boilerplate was as boring as it is hard to write and hard to maintain, making it attractive to resort to shortcuts whenever possible, which in turn tended to undercut how meaningful and maintainable the tests ended up being - that is, if the tests were even written, which sometimes they weren't. This change introduces a datadriven framework specifically for testing deterministically the interaction between multiple members of a raft group with the goal of reducing the friction for writing these tests to near zero. In the near term, this will be used to add thorough testing for joint consensus (which is already available today, but wildly undertested), but just converting an existing test into this framework has shown that the concise representation and built-in inspection of log messages highlights unexpected behavior much more readily than the previous unit tests did (the test in question is `snapshot_succeed_via_app_resp`; the reader is invited to compare the old and new version of it). The main building block is `InteractionEnv`, which holds on to the state of the whole system and exposes various relevant methods for manipulating it, including but not limited to adding nodes, delivering and dropping messages, and proposing configuration changes. All of this is extensible so that in the future I hope to use it to explore the phenomena discussed in #7625 (comment) which requires injecting appropriate "crash points" in the Ready handling loop. Discussions of the "what if X happened in state Y" can quickly be made concrete by "scripting up an interaction test". Additionally, this framework is intentionally not kept internal to the raft package.. Though this is in its infancy, a goal is that it should be possible for a suite of interaction tests to allow applications to validate that their Storage implementation behaves accordingly, simply by running a raft-provided interaction suite against their Storage.
Continuing discussion with @hicqu from #11284 here:
My concern is less that this can't be made work, but that it takes a production use case of That isn't to say that I am not interested in the improvement - I am - just that at this point I can't give any particular commitment to reviewing, improving, or merging the work. However, if this isn't too off-putting, we could scope out the work to get a more concrete picture of what changes would be entailed. For example, how would the |
FYI, in CockroachDB we are now working around this problem by not removing voters directly. We demote them first (i.e. turn them into learners) and then remove them. That avoids the unavailability when the leader crashes. |
@tbg thanks for your effort! I believe it's helpful, however maybe can't resolve the problem at root. Suppose there is a Raft group with peers
A has leaving joint and response the proposer tht A is removed. However B, C, D don't know this. I'm also trying to find a solution for this case. Currently my thought is making NOTE: although the Please take a look, thanks! |
In your example, A is two configurations ahead of B, C, D, which should not be possible if |
Suppose the Raft group enters joint at index
Seems A is only one configurations ahead of B, C and D. |
Right, but what's your example? A is still around until it gets removed. If you're saying a goes down, while another node also goes down, then yes you could lose quorum, but you started out with four nodes where that is also true, and if you wanted to know that it stopped being true, you could wait for a majority to have applied the latest change (ugly, but it works). The thing I really want to avoid is having anything funky happening when replacing a node, i.e. going from |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@tbg What's left to do here? We've been using joint consensus in CRDB for a while now. Is the only open item to allow removal of a voter directly without downgrading to learner first? |
Hi,
Current member change implementation requires at least two nodes works for a cluster. If one node fails in a three nodes cluster, there is a short time gap that the availability risks on another node failure, after the previously failed node is removed, before the new node is added.
Another availability issue arises when balancing the nodes among racks/data centers. It's an usual way to add a new node and then remove one old node among different racks/data centers to do the balancing. After adding a node into a three nodes cluster in one of the three racks/data centers, there will two nodes in one same rack/data center. If this rack/data center fails, the cluster is unavailable. The elaboration for this issue is in tikv/tikv#1468
Both availability issues are related to the member change implementation. To fix them, I suggest to add a "ReplaceNode" primitive in member change. It requires to write and then commit one log entry to achieve the target "remove one existing node and add a new node".
The text was updated successfully, but these errors were encountered: