-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: PreVote can lead to unavailable ranges #18151
Comments
Strange, the badness seems to have migrated to
|
Trying to piece together what happened to
This seems to correspond to this rebalance on
Note that the range already has only 2 replicas at this point. |
Ah, the next few log lines on
So that rebalance never succeeded which is probably why the replica was eventually GC'd be
|
It's normal that every subsequent rebalance would fail if neither of the two existing replicas is the raft leader. This appears to be a bug in PreVote: the replica on node 7 is ahead of the one on node 5 (by the raft log LastIndex), but it has a lower term. Node 7 will never vote (pre-vote or regular) for node 5 because of the lower LastIndex, but node 7's MsgPreVote is getting dropped (early in the raft message pipeline) because its term is too low. Without prevote, terms are adjusted more often so I don't think you could get stuck like this. (This is probably also an artifact of the transition from CheckQuorum to PreVote: node 5 wouldn't have increased its term under PreVote, but now that it has, it's persistent). So before we can re-enable preVote, we need another change to etcd/raft to allow positive responses to preVotes with an older term (as long as they're higher than our last log term). |
Here is when
Along with that removal I see:
|
@bdarnell Do you think the cluster will recover if I restart with pre-vote disabled? |
Yes, this range should recover when pre-vote is disabled. |
Bold declaration. I'll restart the cluster with pre-vote disabled to see if you're right. |
Not too bold. I specifically said "this range", not the whole cluster :) |
Ok, restarting with pre-vote disabled at |
Throughput on the cluster is dropping to 0 again due to |
|
Filed upstream: etcd-io/etcd#8501. |
I've been doing some chaos testing and i haven't seen any persistently-stuck ranges, so I'm tentatively marking this as fixed by etcd-io/etcd#8525. Discussion will continue in #16950. |
Possibly related to #17741.
blue
is running 90840ef. It has been experiencing 0 QPS for most of the last week. The only spikes of throughput are immediately upon a node recovering from chaos. I've now disabled chaos, enabled some additional debugging settings and stopped load on all nodes exceptblue-01
./debug/requests
onblue-01
shows a common pattern:Every single request is stuck sending to
blue-02
. Onblue-02
we see:In the above instance, there are now 145 lease attempts. Would be nice to know why they are failing and why another node isn't grabbing the lease.
Node liveness looks reasonable on every node except for
blue-02
:Hmm, something seems to have unwedged itself as we I just saw a burst of throughput, but now it is falling again.
The text was updated successfully, but these errors were encountered: