Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: improve the availability related to member change #7625

Closed
hhkbp2 opened this issue Mar 29, 2017 · 39 comments
Closed

raft: improve the availability related to member change #7625

hhkbp2 opened this issue Mar 29, 2017 · 39 comments

Comments

@hhkbp2
Copy link
Contributor

hhkbp2 commented Mar 29, 2017

Hi,

Current member change implementation requires at least two nodes works for a cluster. If one node fails in a three nodes cluster, there is a short time gap that the availability risks on another node failure, after the previously failed node is removed, before the new node is added.

Another availability issue arises when balancing the nodes among racks/data centers. It's an usual way to add a new node and then remove one old node among different racks/data centers to do the balancing. After adding a node into a three nodes cluster in one of the three racks/data centers, there will two nodes in one same rack/data center. If this rack/data center fails, the cluster is unavailable. The elaboration for this issue is in tikv/tikv#1468

Both availability issues are related to the member change implementation. To fix them, I suggest to add a "ReplaceNode" primitive in member change. It requires to write and then commit one log entry to achieve the target "remove one existing node and add a new node".

@siddontang
Copy link
Contributor

/cc @xiang90

@heyitsanthony
Copy link
Contributor

I don't understand what the atomicity buys over remove/add?

For the first case, if the node failed, then removing it from membership doesn't seem to make the fault tolerance worse-- the cluster will lose availability if another node fails regardless. The window is slightly shorter with ReplaceNode since there's only one a commit instead of two, but window is still there.

The rack case is more convincing, but I'm not sure if there's an advantage over remove/add at the moment. There's still the risk that the replacement node will be misconfigured and will fail to join with raft, so it loses add/remove's advantage that ensures the new node is participating in raft before shutting off the old node. There would have to be non-voting member support before ReplaceNode can be safe.

@siddontang
Copy link
Contributor

Hi @heyitsanthony

For TiKV or Cockroachdb, which needs to schedule multi raft groups in different DCs, If one DC fails when scheduler is working (moving one replica from a host to another host in same DC), the corresponding raft group can not work unless DC is up again, regardless of whether using add + remove or remove + add.

Of course, we can use 5 replicas to solve the problem in 3 DCs, but I think this may hurt the performance and is not necessary. A better way is to implement atomic Replace operation or joint consensus, both are mentioned in Raft Paper. Supporting joint consensus will change a lot codes in etcd raft, Replace may be more simple here.

@heyitsanthony
Copy link
Contributor

@siddontang sure, I understand that. I'm asking if ReplaceNode will be effective enough at maintaining availability without non-voting member support. The new node could turn out to be incapable of participating in raft but there's no way of knowing until after it's added, so there's still a risk of going from 3 to 2. Non-voting would at least establish that the new node can contact the raft group and accept commits. Or is this not a problem?

ReplaceNode would be useful on its own provided the new node can be guaranteed to catch up with the cluster quickly (which is better than nothing, as it is now), just saying that it seems like it solves only half of it...

@hhkbp2
Copy link
Contributor Author

hhkbp2 commented Mar 30, 2017

@heyitsanthony Yeah, it needs some time to catch up with the leader for a blank new node which is just added into the cluster. And there's still a risk of node failure during this "ReplaceNode" procedure. A better way to perform the member change in this situation would be:

  1. Initialize a non-voting node in the target rack/data center. Replicate snapshot/log to this non-voting node until it's close enough (until the log gap could be covered within one election timeout as described in Raft Thesis 4.2.1)
  2. Perform the "ReplaceNode" action to remove one node and add this prepared one

Each of 1, 2 could be implemented once at a time and then together to solve the issue. We could get 2 as the first step.

@xiang90
Copy link
Contributor

xiang90 commented Mar 30, 2017

@hhkbp2 @siddontang

Yeah, it needs some time to catch up with the leader for a blank new node which is just added into the cluster.

There are cases where you can add a bad member (too slow to catch up, unreachable IP). I think @heyitsanthony feels those risks outperform the risk of allowing the small window between add/remove. If we really care about safety, we should fix the most significant problem first.

The right way to do the replacement is probably:

  1. start non-voting member
  2. non-voting member is sync
  3. replace the old member with non-voting member

@heyitsanthony
Copy link
Contributor

Starting with ReplaceNode seems fine... it's still one less way the cluster can break (even without non-voting) and it needs to be done anyway.

@xiang90
Copy link
Contributor

xiang90 commented Aug 2, 2017

no one is actively working on this. i am going to move it to unplanned.

@tbg
Copy link
Contributor

tbg commented Apr 24, 2019

The CockroachDB team is planning to implement joint consensus in the upcoming ~6 month release cycle (to commence shortly). An in-progress RFC is available.

Similar to what has been stated in this thread, an atomic "replace" primitive would be enough on our end as well, though unless this allows for some kind of simplification we will implement support for arbitrary membership changes.

The research done in the RFC so far suggests that the etcd approach of activating configuration changes only when nodes apply them is undesirable (if not outright unsound). The plan is thus to implement joint consensus as outlined in the Raft thesis, where a configuration change becomes active the moment it is appended to the log. Input on this decision is welcome. We're somewhat in the dark on why etcd/raft originally decided to deviate from the thesis. I traced this all the way back to #1100, but there's not much in terms of justifying that decision. It sort of seemed to just have happened.

The move to joint consensus presents a challenge for upgrades. For compatibility reasons, we need the version of etcd/raft that implements joint consensus to at least tolerate but likely also plan "regular" one-at-a-time membership (V1) changes. However, we're not planning to add full support for returning to V1 membership changes after a V2 membership change.

I'm curious what expectation maintainers here have regarding the interface etcd/raft exposes to the application. For example, are we free to modify the Storage interface to accommodate joint consensus? Taking into account the migration constraints we already have (see above) this means that applications which do not wish to use joint consensus can continue to not use it, though they will have to slightly adapt their interfaces (probably by wrapping their existing Storage into a etcd/raft-provided compatibility wrapper). If we are not allowed to "break" existing applications in this way, we'll provide a StorageV2 interface which can be supplied via the Config and which, if specified, enables the use of V2 (joint consensus) replication changes, while at the same time continuing to support V1 (one-at-a-time changes).

Personally, I believe that asking users to slightly adapt their interfaces is fine, but I would like to have some clarity from the maintainers going forward.

cc @xiang90 @gyuho @yichengq

@jingyih
Copy link
Contributor

jingyih commented Apr 25, 2019

@tbg Do you want to talk about this in etcd community meeting?
https://docs.google.com/document/d/16XEGyPBisZvmmoIHSZzv__LoyOeluC5a4x353CX0SIM/edit#heading=h.tu6uwzxqfg8x

@gyuho
Copy link
Contributor

gyuho commented Apr 25, 2019

I remember @xiang90 had some discussion with @ongardie... Can't find the link.

@gyuho
Copy link
Contributor

gyuho commented Apr 25, 2019

@xiang90

This is the closest that I can find:

https://groups.google.com/d/msg/raft-dev/t4xj6dJTP6E/NMfCgTO90t0J

In etcd/raft implementation, we actually used a different but similar
approach. We only applied the configuration change after it is
committed by the old quorum (rather than as soon as it is added). I
thought this is more safe and the configuration change command would
go through the same commit/apply path as normal command.

I think there is another difference that might worth sharing.

In etcd/raft, the newly joined follower would try to get the most
recent snapshot and append the log entries out of raft protocol.

I feel this can also help to simplify the configuration change
protocol a little bit.

Here is the original discussion between etcd and cockroach team:
#2397 (comment)

tbg added a commit to tbg/etcd that referenced this issue Apr 26, 2019
The Progress maps contain both the active configuration and information
about the replication status. By pulling it into its own component, this
becomes easier to unit test and also clarifies the code, which will see
changes as etcd-io#7625 is addressed.

More functionality will move into `prs` in self-contained follow-up commits.
tbg added a commit to tbg/etcd that referenced this issue Apr 26, 2019
This cleans up the mechanical refactor in the last commit and will
help with etcd-io#7625 as well.
tbg added a commit to tbg/etcd that referenced this issue Apr 26, 2019
This now delegates the quorum computation to r.prs, which will allow
it to generalize in a straightforward way when etcd-io#7625 is
addressed.
@siddontang
Copy link
Contributor

Interesting, @gyuho @xiang90

We have already supported joint consensus in the Rust raft-rs tikv/raft-rs#202, but have not used in TiKV yet. We can contribute back to etcd Raft.

@tbg
Copy link
Contributor

tbg commented Apr 30, 2019

Thanks @jingyih, I'll be there. Added an item the agenda.

tbg added a commit to tbg/etcd that referenced this issue May 1, 2019
The Progress maps contain both the active configuration and information
about the replication status. By pulling it into its own component, this
becomes easier to unit test and also clarifies the code, which will see
changes as etcd-io#7625 is addressed.

More functionality will move into `prs` in self-contained follow-up commits.
tbg added a commit to tbg/etcd that referenced this issue May 1, 2019
This cleans up the mechanical refactor in the last commit and will
help with etcd-io#7625 as well.
tbg added a commit to tbg/etcd that referenced this issue May 1, 2019
This now delegates the quorum computation to r.prs, which will allow
it to generalize in a straightforward way when etcd-io#7625 is
addressed.
@xiang90
Copy link
Contributor

xiang90 commented May 2, 2019

@tbg

So some history here.

When we started to implement etcd/raft, we want to start with a simpler reconfiguration protocol (both to save time and to explore if it is actually feasible). As a result, etcd implemented the atomic membership change first, then Raft thesis published with a modified version of atomic membership change later on. The idea is similar, but they are not exactly the same.

I like apply time reconfiguration since the reconfig command will involve both raft protocol change as well as the application level change. I want the application level change to be strictly ordered. To implement this with commit time reconfiguration, the code/execution path for raft protocol change and application level change for reconfiguration have to be separated. From safety perspective, I remember we added a few barriers to avoid the double in flight reconfiguration. To be honest, I do not member the details now, and I am not sure if the cases described in your analysis can actually happen. If it can happen, I guess it is still fixable :P.

@xiang90
Copy link
Contributor

xiang90 commented May 2, 2019

I am fine with adding joint membership change as an alternative way. Do you really need to replace the atomic one or they can co-exist (at least some can config to use one or the other at start time)?

tbg added a commit to tbg/etcd that referenced this issue Jul 16, 2019
This commit introduces machinery to safely apply joint consensus
configuration changes to Raft.

The main contribution is the new package, `confchange`, which offers
the primitives `Simple`, `EnterJoint`, and `LeaveJoint`.

The first two take a list of configuration changes. `Simple` only
declares success if these configuration changes (applied atomically)
change the set of voters by at most one (i.e. it's fine to add or
remove any number of learners, but change only one voter). `EnterJoint`
makes the configuration joint and then applies the changes to it, in
preparation of the caller returning later and transitioning out of the
joint config into the final desired configuration via `LeaveJoint()`.

This commit streamlines the conversion between voters and learners, which
is now generally allowed whenever the above conditions are upheld (i.e.
it's not possible to demote a voter and add a new voter in the context
of a Simple configuration change, but it is possible via EnterJoint).
Previously, we had the artificial restriction that a voter could not be
demoted to a learner, but had to be removed first.
Even though demoting a learner is generally less useful than promoting
a learner (the latter is used to catch up future voters), demotions
could see use in improved handling of temporary node unavailability,
where it is desired to remove voting power from a down node, but to
preserve its data should it return.

An additional change that was made in this commit is to prevent the use
of empty commit quorums, which was previously possible but for no good
reason; this:

Closes etcd-io#10884.

The work left to do in a future PR is to actually expose joint
configurations to the applications using Raft. This will entail mostly
API design and the addition of suitable testing, which to be carried
out ergonomically is likely to motivate a larger refactor.

Touches etcd-io#7625.
tbg added a commit to tbg/etcd that referenced this issue Jul 16, 2019
This commit introduces machinery to safely apply joint consensus
configuration changes to Raft.

The main contribution is the new package, `confchange`, which offers
the primitives `Simple`, `EnterJoint`, and `LeaveJoint`.

The first two take a list of configuration changes. `Simple` only
declares success if these configuration changes (applied atomically)
change the set of voters by at most one (i.e. it's fine to add or
remove any number of learners, but change only one voter). `EnterJoint`
makes the configuration joint and then applies the changes to it, in
preparation of the caller returning later and transitioning out of the
joint config into the final desired configuration via `LeaveJoint()`.

This commit streamlines the conversion between voters and learners, which
is now generally allowed whenever the above conditions are upheld (i.e.
it's not possible to demote a voter and add a new voter in the context
of a Simple configuration change, but it is possible via EnterJoint).
Previously, we had the artificial restriction that a voter could not be
demoted to a learner, but had to be removed first.
Even though demoting a learner is generally less useful than promoting
a learner (the latter is used to catch up future voters), demotions
could see use in improved handling of temporary node unavailability,
where it is desired to remove voting power from a down node, but to
preserve its data should it return.

An additional change that was made in this commit is to prevent the use
of empty commit quorums, which was previously possible but for no good
reason; this:

Closes etcd-io#10884.

The work left to do in a future PR is to actually expose joint
configurations to the applications using Raft. This will entail mostly
API design and the addition of suitable testing, which to be carried
out ergonomically is likely to motivate a larger refactor.

Touches etcd-io#7625.
tbg added a commit to tbg/etcd that referenced this issue Aug 7, 2019
It turns out that that learners must be allowed to cast votes.

This seems counter- intuitive but is necessary in the situation in which
a learner has been promoted (i.e. is now a voter) but has not learned
about this yet.

For example, consider a group in which id=1 is a learner and id=2 and
id=3 are voters. A configuration change promoting 1 can be committed on
the quorum `{2,3}` without the config change being appended to the
learner's log. If the leader (say 2) fails, there are de facto two
voters remaining. Only 3 can win an election (due to its log containing
all committed entries), but to do so it will need 1 to vote. But 1
considers itself a learner and will continue to do so until 3 has
stepped up as leader, replicates the conf change to 1, and 1 applies it.

Ultimately, by receiving a request to vote, the learner realizes that
the candidate believes it to be a voter, and that it should act
accordingly. The candidate's config may be stale, too; but in that case
it won't win the election, at least in the absence of the bug discussed
in:
etcd-io#7625 (comment).
tbg added a commit to tbg/etcd that referenced this issue Aug 7, 2019
It has often been tedious to test the interactions between multi-member
Raft groups, especially when many steps were required to reach a certain
scenario. Often, this boilerplate was as boring as it is hard to write
and hard to maintain, making it attractive to resort to shortcuts
whenever possible, which in turn tended to undercut how meaningful and
maintainable the tests ended up being - that is, if the tests were even
written, which sometimes they weren't.

This change introduces a datadriven framework specifically for testing
deterministically the interaction between multiple members of a raft group
with the goal of reducing the friction for writing these tests to near
zero.

In the near term, this will be used to add thorough testing for joint
consensus (which is already available today, but wildly undertested),
but just converting an existing test into this framework has shown that
the concise representation and built-in inspection of log messages
highlights unexpected behavior much more readily than the previous unit
tests did (the test in question is `snapshot_succeed_via_app_resp`; the
reader is invited to compare the old and new version of it).

The main building block is `InteractionEnv`, which holds on to the state
of the whole system and exposes various relevant methods for
manipulating it, including but not limited to adding nodes, delivering
and dropping messages, and proposing configuration changes. All of this
is extensible so that in the future I hope to use it to explore the
phenomena discussed in

etcd-io#7625 (comment)

which requires injecting appropriate "crash points" in the Ready
handling loop. Discussions of the "what if X happened in state Y"
can quickly be made concrete by "scripting up an interaction test".

Additionally, this framework is intentionally not kept internal to the
raft package.. Though this is in its infancy, a goal is that it should
be possible for a suite of interaction tests to allow applications to
validate that their Storage implementation behaves accordingly, simply
by running a raft-provided interaction suite against their Storage.
gyuho pushed a commit that referenced this issue Aug 8, 2019
It turns out that that learners must be allowed to cast votes.

This seems counter- intuitive but is necessary in the situation in which
a learner has been promoted (i.e. is now a voter) but has not learned
about this yet.

For example, consider a group in which id=1 is a learner and id=2 and
id=3 are voters. A configuration change promoting 1 can be committed on
the quorum `{2,3}` without the config change being appended to the
learner's log. If the leader (say 2) fails, there are de facto two
voters remaining. Only 3 can win an election (due to its log containing
all committed entries), but to do so it will need 1 to vote. But 1
considers itself a learner and will continue to do so until 3 has
stepped up as leader, replicates the conf change to 1, and 1 applies it.

Ultimately, by receiving a request to vote, the learner realizes that
the candidate believes it to be a voter, and that it should act
accordingly. The candidate's config may be stale, too; but in that case
it won't win the election, at least in the absence of the bug discussed
in:
#7625 (comment).
tbg added a commit to tbg/etcd that referenced this issue Aug 9, 2019
It has often been tedious to test the interactions between multi-member
Raft groups, especially when many steps were required to reach a certain
scenario. Often, this boilerplate was as boring as it is hard to write
and hard to maintain, making it attractive to resort to shortcuts
whenever possible, which in turn tended to undercut how meaningful and
maintainable the tests ended up being - that is, if the tests were even
written, which sometimes they weren't.

This change introduces a datadriven framework specifically for testing
deterministically the interaction between multiple members of a raft group
with the goal of reducing the friction for writing these tests to near
zero.

In the near term, this will be used to add thorough testing for joint
consensus (which is already available today, but wildly undertested),
but just converting an existing test into this framework has shown that
the concise representation and built-in inspection of log messages
highlights unexpected behavior much more readily than the previous unit
tests did (the test in question is `snapshot_succeed_via_app_resp`; the
reader is invited to compare the old and new version of it).

The main building block is `InteractionEnv`, which holds on to the state
of the whole system and exposes various relevant methods for
manipulating it, including but not limited to adding nodes, delivering
and dropping messages, and proposing configuration changes. All of this
is extensible so that in the future I hope to use it to explore the
phenomena discussed in

etcd-io#7625 (comment)

which requires injecting appropriate "crash points" in the Ready
handling loop. Discussions of the "what if X happened in state Y"
can quickly be made concrete by "scripting up an interaction test".

Additionally, this framework is intentionally not kept internal to the
raft package.. Though this is in its infancy, a goal is that it should
be possible for a suite of interaction tests to allow applications to
validate that their Storage implementation behaves accordingly, simply
by running a raft-provided interaction suite against their Storage.
tbg added a commit to tbg/etcd that referenced this issue Aug 12, 2019
It has often been tedious to test the interactions between multi-member
Raft groups, especially when many steps were required to reach a certain
scenario. Often, this boilerplate was as boring as it is hard to write
and hard to maintain, making it attractive to resort to shortcuts
whenever possible, which in turn tended to undercut how meaningful and
maintainable the tests ended up being - that is, if the tests were even
written, which sometimes they weren't.

This change introduces a datadriven framework specifically for testing
deterministically the interaction between multiple members of a raft group
with the goal of reducing the friction for writing these tests to near
zero.

In the near term, this will be used to add thorough testing for joint
consensus (which is already available today, but wildly undertested),
but just converting an existing test into this framework has shown that
the concise representation and built-in inspection of log messages
highlights unexpected behavior much more readily than the previous unit
tests did (the test in question is `snapshot_succeed_via_app_resp`; the
reader is invited to compare the old and new version of it).

The main building block is `InteractionEnv`, which holds on to the state
of the whole system and exposes various relevant methods for
manipulating it, including but not limited to adding nodes, delivering
and dropping messages, and proposing configuration changes. All of this
is extensible so that in the future I hope to use it to explore the
phenomena discussed in

etcd-io#7625 (comment)

which requires injecting appropriate "crash points" in the Ready
handling loop. Discussions of the "what if X happened in state Y"
can quickly be made concrete by "scripting up an interaction test".

Additionally, this framework is intentionally not kept internal to the
raft package.. Though this is in its infancy, a goal is that it should
be possible for a suite of interaction tests to allow applications to
validate that their Storage implementation behaves accordingly, simply
by running a raft-provided interaction suite against their Storage.
gyuho pushed a commit that referenced this issue Aug 12, 2019
It has often been tedious to test the interactions between multi-member
Raft groups, especially when many steps were required to reach a certain
scenario. Often, this boilerplate was as boring as it is hard to write
and hard to maintain, making it attractive to resort to shortcuts
whenever possible, which in turn tended to undercut how meaningful and
maintainable the tests ended up being - that is, if the tests were even
written, which sometimes they weren't.

This change introduces a datadriven framework specifically for testing
deterministically the interaction between multiple members of a raft group
with the goal of reducing the friction for writing these tests to near
zero.

In the near term, this will be used to add thorough testing for joint
consensus (which is already available today, but wildly undertested),
but just converting an existing test into this framework has shown that
the concise representation and built-in inspection of log messages
highlights unexpected behavior much more readily than the previous unit
tests did (the test in question is `snapshot_succeed_via_app_resp`; the
reader is invited to compare the old and new version of it).

The main building block is `InteractionEnv`, which holds on to the state
of the whole system and exposes various relevant methods for
manipulating it, including but not limited to adding nodes, delivering
and dropping messages, and proposing configuration changes. All of this
is extensible so that in the future I hope to use it to explore the
phenomena discussed in

#7625 (comment)

which requires injecting appropriate "crash points" in the Ready
handling loop. Discussions of the "what if X happened in state Y"
can quickly be made concrete by "scripting up an interaction test".

Additionally, this framework is intentionally not kept internal to the
raft package.. Though this is in its infancy, a goal is that it should
be possible for a suite of interaction tests to allow applications to
validate that their Storage implementation behaves accordingly, simply
by running a raft-provided interaction suite against their Storage.
@tbg
Copy link
Contributor

tbg commented Oct 28, 2019

Continuing discussion with @hicqu from #11284 here:

@tbg thanks for your replay! Seems you worry about rolling update a Raft cluster. How about making the new behavior as an option? Only if it's enabled configuration changes become effective on received, otherwize on applied, just like now. And we can add a rolling update interface on Storage interface to document how to update a cluster to enable the feature. According to our research, the update is safe, what applications need to do is just read all received but not applied Raft logs and read configuration changes in them when restart.

My concern is less that this can't be made work, but that it takes a production use case of etcd/raft to work out the kinks. If I understand correctly, you're using "proper" joint consensus in TiKV already (?) and so you are likely aware of most of the problems (you had upgrade issues to worry about as well, correct?). Yet, I just worry that there will end up being more work that is necessary than you're able to contribute (since you're not using etcd/raft yourself - correct?), and that we're going to get stuck awkwardly in-between. For example, intuitively I feel that we shouldn't try to maintain two mechanisms side by side, simply because I suspect that this will lead to a lot of confusion in the code (this may not be true). But then we force all apps into a more complicated contract (ApplyConfChange may now be called for config changes that end up being rolled back, but existing apps have sometimes attached triggers to these, etc) and in particular we'll be forced to make changes to etcdserver and roll back some weird things they do there. (If "both APIs" can be made work first, I'd prefer that, to avoid buying into the new mechanism too deeply).

That isn't to say that I am not interested in the improvement - I am - just that at this point I can't give any particular commitment to reviewing, improving, or merging the work. However, if this isn't too off-putting, we could scope out the work to get a more concrete picture of what changes would be entailed. For example, how would the Node interface have to change, which areas in the code would need to change, etc -- all of this could be demonstrated in a prototype. Even if the result is that we won't follow through at the moment, this will a be a good exercise to set us up for success once we do want to follow through.

@tbg
Copy link
Contributor

tbg commented Nov 7, 2019

FYI, in CockroachDB we are now working around this problem by not removing voters directly. We demote them first (i.e. turn them into learners) and then remove them. That avoids the unavailability when the leader crashes.

@hicqu
Copy link
Contributor

hicqu commented Nov 7, 2019

@tbg thanks for your effort! I believe it's helpful, however maybe can't resolve the problem at root.

Suppose there is a Raft group with peers A, B, C and D. In one moment their configurations could be

A: (B, C, D)
B: (A, B, C, D) && (B, C, D), learners_next: (A)
C: (A, B, C, D) && (B, C, D), learners_next: (A)
D: (A, B, C, D) && (B, C, D), learners_next: (A)

A has leaving joint and response the proposer tht A is removed. However B, C, D don't know this.
Then A and B fail, C and D still can't elect a leader.

I'm also trying to find a solution for this case. Currently my thought is making leaving joint effective on received. So the problem can be resolved perfectly, and seems the compatibility won't be impacted too much.

NOTE: although the leaving joint entry become effective on received, the old leader responses the proposer after the entry is committed, which means majority peers of the group have received the entry. So, the problem is resolved.

Please take a look, thanks!

@tbg
Copy link
Contributor

tbg commented Nov 7, 2019

In your example, A is two configurations ahead of B, C, D, which should not be possible if (Append) from #7625 (comment) holds. A quorum of B,C,D will know that they can switch to (B, C, D) + Learner(A) before campaigning, so they can elect a leader. Am I missing something?

@hicqu
Copy link
Contributor

hicqu commented Nov 7, 2019

Suppose the Raft group enters joint at index 10, and it's leaving joint at index 11. So their configuration and states could be:

A: voters: (B, C, D), learners: (A), last_index: 11, committed: 11, applied: 11
B: voters: (A, B, C, D)&&(B, C, D), learners_next: (A), last_index: 11, committed: 10, applied: 10
C: voters: (A, B, C, D)&&(B, C, D), learners_next: (A), last_index: 11, committed: 10, applied: 10
D: voters: (A, B, C, D)&&(B, C, D), learners_next: (A), last_index: 11, committed: 10, applied: 10

Seems A is only one configurations ahead of B, C and D.
And B, C, D maybe can't apply to 11 forever, because A hasn't casted any commit to 11 to them before fail.

@tbg
Copy link
Contributor

tbg commented Nov 7, 2019

Right, but what's your example? A is still around until it gets removed. If you're saying a goes down, while another node also goes down, then yes you could lose quorum, but you started out with four nodes where that is also true, and if you wanted to know that it stopped being true, you could wait for a majority to have applied the latest change (ugly, but it works).

The thing I really want to avoid is having anything funky happening when replacing a node, i.e. going from A B C to A B D. Without the intermediate learner step, an example similar to 2) from your initial post shows that the leaseholder failing at the "right" moment will wedge the raft group because the removal of C being processed on C can also lead to C going "down", as in, not responding to requests any more.

@stale
Copy link

stale bot commented Apr 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 6, 2020
@bdarnell
Copy link
Contributor

bdarnell commented Apr 7, 2020

@tbg What's left to do here? We've been using joint consensus in CRDB for a while now. Is the only open item to allow removal of a voter directly without downgrading to learner first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests